[00:01:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [00:03:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:42] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:05:39] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139577|enwiki and commons: Increase revision-slots cache expiry again (T183490)]] (duration: 13m 45s) [00:05:44] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [00:06:31] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2241.codfw.wmnet with reason: Maintenance [00:06:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[2242-2243].codfw.wmnet with reason: Maintenance [00:10:24] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 614.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [01:03:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:04:37] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968 (10phaultfinder) 03NEW [01:19:37] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10778723 (10phaultfinder) [01:22:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:50:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:52:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:58:42] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:20] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:00:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:02:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:03:50] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:08:24] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:27:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:29:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:40:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:41:50] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:49:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:51:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:02:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:50] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:18:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:35:20] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:36:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:59:56] (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [04:14:31] 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Console/management wiring - https://phabricator.wikimedia.org/T382383#10778823 (10Papaul) 05Open→03Resolved This is complete [04:15:03] (03PS1) 10Papaul: Add new PDU's in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) [04:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:31:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:32:52] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:47:20] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:48:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:03:16] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:03:44] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:09:25] (03PS2) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) [05:09:25] (03CR) 10Arnaudb: "I had to rebase locally due to merge conflicts, lmk if you spot anything weird" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:12:56] (03CR) 10Arnaudb: "nitpick comment added" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [05:22:46] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:25:03] FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [05:27:04] (03PS1) 10Arnaudb: gerrit: failover bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) [05:27:04] (03CR) 10Arnaudb: "Prepping for today's switchover I stumbled upon this error" [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [05:29:44] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:03] RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [05:30:16] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:34:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:49:11] Deploying MinT on the staging. [05:51:41] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply [05:58:22] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10778904 (10VRiley-WMF) Created a Dell service request for this Service Request 209252181. [05:58:42] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:28] (03PS1) 10Marostegui: wmnet: Switchover m1-master [dns] - 10https://gerrit.wikimedia.org/r/1139983 (https://phabricator.wikimedia.org/T392806) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0600) [06:00:44] !log Failover m1-master T392806 [06:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:09] (03CR) 10Marostegui: [C:03+2] wmnet: Switchover m1-master [dns] - 10https://gerrit.wikimedia.org/r/1139983 (https://phabricator.wikimedia.org/T392806) (owner: 10Marostegui) [06:01:13] !log marostegui@dns1006 START - running authdns-update [06:01:27] (03CR) 10Ayounsi: [C:03+2] Fastnetmon: bump threshold_pps to 1.75M [puppet] - 10https://gerrit.wikimedia.org/r/1139503 (owner: 10Ayounsi) [06:01:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:43] !log marostegui@dns1006 END - running authdns-update [06:11:48] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:13:33] !log magru: remove novaacore/momentum [06:13:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:14:11] (03CR) 10Ayounsi: [C:03+2] magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) (owner: 10Ayounsi) [06:14:45] (03Merged) 10jenkins-bot: magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) (owner: 10Ayounsi) [06:15:06] (03PS1) 10KartikMistry: cxserver: Use URL instead of mw.Uri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) [06:28:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10778959 (10VRiley-WMF) [06:31:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10778961 (10VRiley-WMF) [06:31:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10778962 (10VRiley-WMF) 05Open→03Resolved [06:32:33] jouncebot: refresh [06:32:33] I refreshed my knowledge about deployments. [06:32:38] jouncebot: nowandnext [06:32:38] For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0600) [06:32:38] In 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0700) [06:33:00] bots driven development [06:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:36:07] triaging is fun [06:36:07] [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwiki.wikilambda_zobject_function_join' doesn't exist Function: MediaWiki\Extension\WikiLambda\ZObjectStore::findFirstZImplementationFunction Query: SELECT wlzf_zfunction_zid [06:36:08] :) [06:37:30] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10778980 (10wiki_willy) @VRiley-WMF & @Jclark-ctr - can you grab a spare from one of the decom'd servers for this? >>! In T392751#10770238, @Marostegui wrote... [06:41:24] (03PS1) 10Slyngshede: P:idp Default OIDC services to FLAT profile [puppet] - 10https://gerrit.wikimedia.org/r/1140074 [06:42:20] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10778984 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF [06:42:26] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry) [06:46:59] (03CR) 10Nikerabbit: cxserver: Use URL instead of mw.Uri (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry) [06:47:35] (03CR) 10Elukey: [C:03+2] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [06:48:38] (03CR) 10KartikMistry: cxserver: Use URL instead of mw.Uri (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry) [06:49:15] (03PS2) 10KartikMistry: ContentTranslation: Add protocol to cxserver URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) [06:51:34] (03PS1) 10Jelto: gerrit: add more IP ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1140075 (https://phabricator.wikimedia.org/T392467) [06:56:54] !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [06:58:40] !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [06:59:27] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5407/console" [puppet] - 10https://gerrit.wikimedia.org/r/1140074 (owner: 10Slyngshede) [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0700). Please do the needful. [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:57] (03PS1) 10Elukey: Revert^2 "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140078 [07:01:18] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:02:20] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:02:35] I'll go ahead with my config patch.. [07:04:01] (03Abandoned) 10Slyngshede: P:idp Default OIDC services to FLAT profile [puppet] - 10https://gerrit.wikimedia.org/r/1140074 (owner: 10Slyngshede) [07:04:04] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10779031 (10VRiley-WMF) [07:05:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10779033 (10VRiley-WMF) apus-fe1003 Racked and added into netbox C2 U14 [07:05:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry) [07:06:35] (03Merged) 10jenkins-bot: ContentTranslation: Add protocol to cxserver URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry) [07:07:38] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1139986|ContentTranslation: Add protocol to cxserver URL (T390241)]] [07:07:43] T390241: [Request] mediawiki.Uri is deprecated, use URL instead in ContentTranslation - https://phabricator.wikimedia.org/T390241 [07:07:55] (03PS1) 10Slyngshede: Permission management: Add pagination to log [software/bitu] - 10https://gerrit.wikimedia.org/r/1140080 [07:08:31] (03CR) 10Filippo Giunchedi: "Thank you for the patch, production has switched to Prometheus alerting for Puppet runs. The file was likely left behind as an oversight a" [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) (owner: 10Dwisehaupt) [07:08:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1033 and es2033 to es2 masters T391921', diff saved to https://phabricator.wikimedia.org/P75674 and previous config saved to /var/cache/conftool/dbconfig/20250430-070853-marostegui.json [07:08:59] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [07:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2031 es1026 T391921', diff saved to https://phabricator.wikimedia.org/P75675 and previous config saved to /var/cache/conftool/dbconfig/20250430-070937-marostegui.json [07:09:48] (03CR) 10Filippo Giunchedi: [C:03+1] "+Tiziano since he's working on PDUs too JFYI. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul) [07:10:20] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:10:45] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1026.eqiad.wmnet [07:11:05] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es1026 - Upgrading es1026.eqiad.wmnet [07:11:08] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2031.codfw.wmnet [07:11:12] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1026 - Upgrading es1026.eqiad.wmnet [07:11:16] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:11:28] (03CR) 10Ilias Sarantopoulos: [C:03+1] Revert^2 "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140078 (owner: 10Elukey) [07:11:29] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2031 - Upgrading es2031.codfw.wmnet [07:11:36] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2031 - Upgrading es2031.codfw.wmnet [07:12:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10779061 (10VRiley-WMF) Added devices into netbox. Need to plan for rack placment. [07:12:14] (03PS1) 10Marostegui: es1026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140082 (https://phabricator.wikimedia.org/T391921) [07:12:52] (03PS1) 10Marostegui: wmnet: Update es2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1140083 (https://phabricator.wikimedia.org/T391921) [07:13:34] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2031 - Upgrading es2031.codfw.wmnet [07:13:41] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2031 - Upgrading es2031.codfw.wmnet [07:14:12] marostegui@cumin1002 upgrade (PID 519111) is awaiting input [07:14:35] !log kartik@deploy1003 kartik: Backport for [[gerrit:1139986|ContentTranslation: Add protocol to cxserver URL (T390241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:14:39] (03CR) 10Elukey: [C:03+2] Revert^2 "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140078 (owner: 10Elukey) [07:14:39] T390241: [Request] mediawiki.Uri is deprecated, use URL instead in ContentTranslation - https://phabricator.wikimedia.org/T390241 [07:15:02] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es1026 - Upgrading es1026.eqiad.wmnet [07:15:09] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1026 - Upgrading es1026.eqiad.wmnet [07:15:36] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es1026.eqiad.wmnet [07:16:27] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es2031.codfw.wmnet [07:16:48] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10779083 (10VRiley-WMF) a:03VRiley-WMF [07:17:29] !log kartik@deploy1003 kartik: Continuing with sync [07:18:03] FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [07:18:22] yes yes [07:18:24] !incidents [07:18:24] 6070 (UNACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw) [07:18:25] 6069 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad) [07:18:25] 6068 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [07:18:29] !ack 6070 [07:18:29] 6070 (ACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw) [07:18:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:18:44] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2031.codfw.wmnet with reason: Maintenance [07:18:51] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1026.eqiad.wmnet with reason: Maintenance [07:19:15] there was an earlier page about another sessionstore host, which then recovered [07:19:53] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2031.codfw.wmnet with reason: Maintenance [07:20:14] not sure what the spikes up are in utilization, I'd guess compactions though, cc urandom [07:20:19] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [07:20:57] (03CR) 10Marostegui: [C:03+2] wmnet: Update es2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1140083 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:21:58] !log marostegui@dns1006 START - running authdns-update [07:23:03] RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [07:23:58] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139986|ContentTranslation: Add protocol to cxserver URL (T390241)]] (duration: 16m 19s) [07:24:03] T390241: [Request] mediawiki.Uri is deprecated, use URL instead in ContentTranslation - https://phabricator.wikimedia.org/T390241 [07:24:29] !log marostegui@dns1006 END - running authdns-update [07:24:38] (03CR) 10Marostegui: [C:03+2] es1026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140082 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:26:35] (03PS1) 10Marostegui: es2031: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140085 (https://phabricator.wikimedia.org/T391921) [07:27:21] (03CR) 10Marostegui: [C:03+2] es2031: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140085 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui) [07:27:38] (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [07:27:40] (03PS1) 10Elukey: admin_ng: allow to set seccomp for Knative-based pods on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140086 (https://phabricator.wikimedia.org/T369493) [07:29:05] !log Finished migrating es2 to MariaDB 10.11 T391921 [07:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:10] T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921 [07:29:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75676 and previous config saved to /var/cache/conftool/dbconfig/20250430-072956-root.json [07:30:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75677 and previous config saved to /var/cache/conftool/dbconfig/20250430-073009-root.json [07:34:21] (03CR) 10Elukey: [C:03+2] admin_ng: allow to set seccomp for Knative-based pods on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140086 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:35:18] !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [07:35:38] !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [07:36:15] (03CR) 10Filippo Giunchedi: [C:03+1] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866) (owner: 10Tiziano Fogli) [07:37:05] (03PS2) 10Klausman: thanos/swift: at pseudo secrets for mint_ro [labs/private] - 10https://gerrit.wikimedia.org/r/1140112 [07:43:05] (03PS8) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) [07:45:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75679 and previous config saved to /var/cache/conftool/dbconfig/20250430-074502-root.json [07:45:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75680 and previous config saved to /var/cache/conftool/dbconfig/20250430-074515-root.json [07:47:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet [07:47:49] (03PS1) 10Klausman: thanos/swift: add user for Mint, with r/o access [puppet] - 10https://gerrit.wikimedia.org/r/1140118 [07:47:49] (03CR) 10Klausman: [V:03+1] "The changes to the pseudo-private and actual-private repos have already been merged." [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (owner: 10Klausman) [07:47:59] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be1088.eqiad.wmnet [07:48:03] (03PS2) 10Klausman: thanos/swift: add user for Mint, with r/o access [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) [07:48:16] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:48:23] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:48:38] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet [07:49:00] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [07:49:06] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [07:50:54] (03CR) 10Brouberol: [C:03+1] Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis) [07:51:47] (03CR) 10Brouberol: [C:03+1] "Diff looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [07:53:42] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:54:42] (03PS1) 10Elukey: ml-services: enable seccomp defaults for ml-serve-codfw's isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140120 (https://phabricator.wikimedia.org/T369493) [07:55:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:55:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:58:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2230.codfw.wmnet,db1176.eqiad.wmnet with reason: Maintenance [08:00:05] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800) [08:00:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75681 and previous config saved to /var/cache/conftool/dbconfig/20250430-080007-root.json [08:00:11] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1088.eqiad.wmnet [08:00:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75682 and previous config saved to /var/cache/conftool/dbconfig/20250430-080021-root.json [08:02:54] good morning, this is Antoine your train conductor for the day. It is sunny outside with no errors, we will scap take off in a short time, please fasten your seat belts and watch your favorite bugs [08:03:03] (03PS1) 10MVernon: Swift: mark ms-be1060 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1140121 (https://phabricator.wikimedia.org/T392796) [08:04:09] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140122 (https://phabricator.wikimedia.org/T386222) [08:04:10] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140122 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [08:04:27] (03CR) 10Elukey: [C:03+1] "LGTM, but please wait for the final sign-off from Data Persistence (added Matthew to the change)." [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) (owner: 10Klausman) [08:04:57] (03CR) 10Marostegui: [C:03+1] Swift: mark ms-be1060 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1140121 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon) [08:04:59] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140122 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [08:05:41] (03CR) 10MVernon: [C:03+2] Swift: mark ms-be1060 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1140121 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon) [08:09:04] (03CR) 10Cyndywikime: Growth-Beta: Configure higher Impact Module edit limits for pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime) [08:10:46] 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10779215 (10VRiley-WMF) 05Open→03Resolved Swapped out the drive. Checked in with @Marostegui everything seems to be good. Closing this out. [08:13:26] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5410/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:14:24] (03PS1) 10Majavah: dynamicproxy: Use lua-resty-redis from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) [08:14:30] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: use read-only object storage credentials on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [08:14:34] hashar: seat belts? on trains? [08:15:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75683 and previous config saved to /var/cache/conftool/dbconfig/20250430-081512-root.json [08:15:23] (03CR) 10Alexandros Kosiaris: [C:03+1] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey) [08:15:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75684 and previous config saved to /var/cache/conftool/dbconfig/20250430-081526-root.json [08:17:05] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:17:57] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.27 refs T386222 [08:18:02] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [08:18:31] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5411/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:18:46] taavi: we are in QA, safety first!! :b [08:19:29] (03PS2) 10Majavah: dynamicproxy: Use lua-resty-redis from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) [08:19:38] MediaWiki\Mail\RecentChangeMailComposer::__construct(): Argument #6 ($timestamp) must be of type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.25/includes/mail/EmailNotification.php on line 222 [08:19:39] oh yeah [08:19:46] so mails are not sent [08:19:52] (I imagine) [08:20:51] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5412/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:21:09] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140126 (https://phabricator.wikimedia.org/T386222) [08:21:12] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140126 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [08:21:30] I am rolling back, MediaWiki does not send recent changes email notifications anymore [08:22:08] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140126 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [08:24:41] (03PS1) 10Brouberol: airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) [08:26:55] (03CR) 10Majavah: [V:03+1 C:03+2] dynamicproxy: Use lua-resty-redis from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:27:13] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10779309 (10MatthewVernon) Hi, it's crashed again, after about an hour as far as I can tell (23:13:14 UTC).... [08:27:41] filed the sessionstore pages as T392989 [08:27:42] T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989 [08:28:21] (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [08:28:31] jouncebot: now and next [08:28:31] For the next 1 hour(s) and 31 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800) [08:29:45] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] thanos: enable auto memlimit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi) [08:29:50] (03CR) 10Arnaudb: [C:03+2] gerrit: failover bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb) [08:29:57] !log Rolled back MediaWiki train from group 1 to group 0 due to T392988 # T386222 [08:30:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:02] T392988: TypeError: MediaWiki\Mail\RecentChangeMailComposer::__construct(): Argument #6 ($timestamp) must be of type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.25/includes/mail/EmailNotification.php on line 222 - https://phabricator.wikimedia.org/T392988 [08:30:03] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [08:30:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75686 and previous config saved to /var/cache/conftool/dbconfig/20250430-083017-root.json [08:30:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75687 and previous config saved to /var/cache/conftool/dbconfig/20250430-083032-root.json [08:33:11] !log ms-be1060 T392796 /usr/local/bin/swift_ring_manager -o /var/cache/swift_rings --doit --skip-dispersion-check --skip-replication-check --immediate-only -v [08:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:16] T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796 [08:33:42] RESOLVED: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:34:32] (03CR) 10MVernon: [C:03+2] Swift: drain ms-be2080 (prep for VLAN move) [puppet] - 10https://gerrit.wikimedia.org/r/1138830 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon) [08:35:24] !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.27 refs T386222 [08:35:29] T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222 [08:37:42] (03PS1) 10MVernon: swift: remove ms-be1060 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1140130 (https://phabricator.wikimedia.org/T392796) [08:41:20] (03PS16) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) [08:41:48] (03CR) 10Klausman: [C:03+1] ml-services: enable seccomp defaults for ml-serve-codfw's isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140120 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:45:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75688 and previous config saved to /var/cache/conftool/dbconfig/20250430-084523-root.json [08:45:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75689 and previous config saved to /var/cache/conftool/dbconfig/20250430-084537-root.json [08:48:02] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:50:24] (03CR) 10Majavah: [C:03+2] dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah) [08:50:34] (03PS2) 10Anzx: mswikisource: add Karya and Gerbang namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) [08:51:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [08:53:15] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [08:54:10] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org [08:54:30] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10779403 (10tappof) Hi @Madalina, While I was checking, I noticed that you've already been added to the group. ` root@...:~#... [08:56:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10779409 (10tappof) [08:57:25] (03CR) 10Btullis: [C:03+1] airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) (owner: 10Brouberol) [08:59:34] jmm@cumin2002 drain-node (PID 3548900) is awaiting input [09:00:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org [09:00:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75690 and previous config saved to /var/cache/conftool/dbconfig/20250430-090028-root.json [09:00:30] (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) (owner: 10Brouberol) [09:00:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75691 and previous config saved to /var/cache/conftool/dbconfig/20250430-090041-root.json [09:01:27] jouncebot: nowandnext [09:01:27] For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800) [09:01:27] In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000) [09:01:49] hashar: Any opposition for me to deploy a security patch now? [09:02:21] (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) (owner: 10Brouberol) [09:02:52] FYI, aux-k8s-etcd2004 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees) [09:02:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [09:03:58] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs1013.eqiad.wmnet} and A:liberica [09:04:18] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs1013.eqiad.wmnet} and A:liberica [09:05:28] PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [09:07:10] PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:09:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [09:10:09] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet [09:10:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [09:10:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast7001.wikimedia.org [09:10:30] RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms [09:12:36] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet [09:13:10] RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:13:42] FIRING: [7x] ProbeDown: Service ganeti2019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75692 and previous config saved to /var/cache/conftool/dbconfig/20250430-091534-root.json [09:15:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75693 and previous config saved to /var/cache/conftool/dbconfig/20250430-091547-root.json [09:16:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast7001.wikimedia.org [09:16:31] (03PS2) 10Majavah: dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) [09:16:33] (03CR) 10Elukey: [C:03+2] ml-services: enable seccomp defaults for ml-serve-codfw's isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140120 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:17:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [09:17:36] !log manual restart of the waterline service on maps1009 [09:17:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:16] (03CR) 10CI reject: [V:04-1] dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) (owner: 10Majavah) [09:18:20] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:19:29] (03PS3) 10Majavah: dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) [09:22:45] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:23:59] jmm@cumin2002 drain-node (PID 3573657) is awaiting input [09:24:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [09:24:44] (03PS1) 10Filippo Giunchedi: thanos: move to native trace sampling 0.1% [puppet] - 10https://gerrit.wikimedia.org/r/1140135 (https://phabricator.wikimedia.org/T392994) [09:25:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10779532 (10fnegri) [09:26:33] (03PS1) 10Brouberol: dse-k8s-eqiad: substantially increase job sidecar controller CPU resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140136 (https://phabricator.wikimedia.org/T392995) [09:26:43] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:27:56] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:28:44] !log bounce prometheus-statsd-exporter on stat1011 - T389344 [09:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:48] T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344 [09:28:55] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:29:48] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:30:29] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [09:30:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75694 and previous config saved to /var/cache/conftool/dbconfig/20250430-093040-root.json [09:30:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75695 and previous config saved to /var/cache/conftool/dbconfig/20250430-093053-root.json [09:31:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [09:31:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [09:31:48] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [09:32:32] (03CR) 10Klausman: [V:03+2 C:03+2] admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 (owner: 10Klausman) [09:32:54] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' . [09:33:42] FIRING: [7x] ProbeDown: Service ganeti2020:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:44] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' . [09:35:09] !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet [09:35:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10779603 (10fnegri) 05Resolved→03Open Reopening as unfortunately the alert is still flapping. It looks like the whole rack's temperatu... [09:35:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:36:10] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' . [09:36:34] the high latency for mlserve is me, I am deploying a lot of services, going to pause for a sec [09:38:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [09:38:10] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [09:38:47] !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' . [09:40:31] (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: substantially increase job sidecar controller CPU resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140136 (https://phabricator.wikimedia.org/T392995) (owner: 10Brouberol) [09:41:16] (03CR) 10Superpes15: "It seems that you didn't run tox as indicated on logos/README.md! Did you follow the steps provided??" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [09:41:54] !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet [09:41:58] (03PS1) 10Elukey: admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140140 (https://phabricator.wikimedia.org/T369493) [09:42:51] (03CR) 10Muehlenhoff: [C:03+2] Add krb1002 to the list of KDCs presented to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [09:42:58] (03CR) 10Superpes15: [C:04-1] "Please follow logos/README.md when you try to change a logo (you need to use tox)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [09:44:14] (03CR) 10FNegri: [C:03+1] "SGTM. If I understand correctly how contactgroups work, this will only affect Icinga alerts? For example we currently have a flapping aler" [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis) [09:44:58] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto) [09:45:00] jmm@cumin2002 drain-node (PID 3595577) is awaiting input [09:45:10] (03CR) 10Arnaudb: [C:03+1] gerrit: add more IP ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1140075 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [09:45:21] (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto) [09:45:49] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10779631 (10Jelto) I switched the replica to use the read-only credentials but unfortunately I get a `AccessDenied` error when acce... [09:45:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:46:28] (03CR) 10Jelto: [C:03+2] gerrit: add more IP ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1140075 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto) [09:48:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [09:50:26] (03CR) 10Vgutierrez: varnish: Replace X-IS-ALT-DOMAIN with variable (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [09:51:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10779640 (10Stevemunene) [09:52:34] !jouncebot nowandnext [09:52:35] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [09:52:48] jouncebot: nowandnext [09:52:48] For the next 0 hour(s) and 7 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800) [09:52:48] In 0 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000) [09:53:09] hashar: can I deploy a security patch now? [09:54:45] kostajh: yes sure [09:54:50] do note I have rolled back the train this morning [09:54:59] https://versions.toolforge.org/ [09:55:22] so we are mostly still on wmf.25. After lunch I will revisit the blocker and see whether it might have been a red hearing [09:55:25] I am off for lunch: [09:55:26] ! [09:55:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [09:55:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [09:55:56] so if you need assistance, we can do it this afternoon :) [09:58:42] FIRING: [7x] ProbeDown: Service ganeti2021:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:58:47] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000) [10:01:03] (03CR) 10FNegri: [C:03+1] P:toolforge: disable_tool: Don't log diffs with secrets [puppet] - 10https://gerrit.wikimedia.org/r/1139443 (owner: 10Majavah) [10:01:26] (03PS1) 10Muehlenhoff: Stop passing krb2002 to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1140142 (https://phabricator.wikimedia.org/T390863) [10:01:28] (03PS1) 10Muehlenhoff: Switch krb2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1140143 (https://phabricator.wikimedia.org/T390863) [10:01:57] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet [10:04:16] hopefully no assistance needed :) [10:04:20] I'm starting the deploy now [10:04:34] (03CR) 10Hnowlan: [C:03+1] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [10:04:38] (03CR) 10FNegri: [C:03+1] "The alertmanager team routing is currently at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hie" [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis) [10:04:55] (03PS2) 10Federico Ceratto: sre.mysql.pool: remove connection count [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) [10:04:55] (03CR) 10Federico Ceratto: "A small cleanup." [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto) [10:06:22] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:06:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet [10:07:18] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:13:00] (03CR) 10Majavah: [C:03+2] P:toolforge: disable_tool: Don't log diffs with secrets [puppet] - 10https://gerrit.wikimedia.org/r/1139443 (owner: 10Majavah) [10:13:16] (03PS1) 10Effie Mouzeli: admin: move jiji to ops-limited Bug: T392998 [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) [10:13:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet [10:13:42] FIRING: [7x] ProbeDown: Service ganeti2022:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:13:54] (03PS2) 10Effie Mouzeli: admin: move jiji to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) [10:13:56] (03CR) 10CI reject: [V:04-1] admin: move jiji to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) (owner: 10Effie Mouzeli) [10:13:56] sycning now [10:14:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet [10:15:56] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database nupwiki (T390714) [10:16:00] T390714: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714 [10:16:06] !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database nupwiki (T390714) [10:16:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet [10:17:44] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) (owner: 10Effie Mouzeli) [10:19:24] RECOVERY - MegaRAID on db1171 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:23:47] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet [10:24:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet [10:32:33] (03CR) 10Hnowlan: [C:03+2] mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [10:32:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet [10:33:12] jouncebot: nowandnext [10:33:13] For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000) [10:33:13] In 0 hour(s) and 26 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1100) [10:33:22] !log installing curl security updates [10:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:41] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10779759 (10MatthewVernon) I think I found the relevant request - was this about 08:33 UTC today (and then 09:07 and 09:27)? ` Apr... [10:37:09] hnowlan: I'm deploying a security patch [10:37:45] hashar: our patch had an issue, so we're making an update to it, and will sync that. Then we'll sync another patch to wmf.27. [10:39:32] kostajh: ack, I don't have any conflicts [10:39:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet [10:40:07] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet [10:40:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet [10:40:17] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2033.codfw.wmnet [10:40:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:41:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet [10:41:48] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:43:42] FIRING: [7x] ProbeDown: Service ganeti2031:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:45:35] jmm@cumin2002 drain-node (PID 3659261) is awaiting input [10:45:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet [10:46:40] (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: substantially increase job sidecar controller CPU resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140136 (https://phabricator.wikimedia.org/T392995) (owner: 10Brouberol) [10:46:53] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [10:47:01] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [10:48:20] !log remove cloudcontrol1005 (decom) from eqiad/codfw core routers [10:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:49:33] (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) (owner: 10Klausman) [10:51:05] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:51:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet [10:51:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet [10:51:25] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:51:29] (03PS1) 10Hnowlan: mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) [10:51:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet [10:52:01] (03CR) 10MVernon: [C:04-1] "said change is https://gerrit.wikimedia.org/r/c/labs/private/+/1140112 (I'm just noting this here so I can find it later if needed)." [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) (owner: 10Klausman) [10:52:31] (03PS1) 10Ayounsi: cr3/4-ulsfo: Set et-0/0/0.0 OSPF metric to 100 [homer/public] - 10https://gerrit.wikimedia.org/r/1140153 (https://phabricator.wikimedia.org/T390731) [10:52:45] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [10:53:18] (03CR) 10Ayounsi: [C:03+2] "Self merging as it should result in a NOOP." [homer/public] - 10https://gerrit.wikimedia.org/r/1140153 (https://phabricator.wikimedia.org/T390731) (owner: 10Ayounsi) [10:53:58] (03Merged) 10jenkins-bot: cr3/4-ulsfo: Set et-0/0/0.0 OSPF metric to 100 [homer/public] - 10https://gerrit.wikimedia.org/r/1140153 (https://phabricator.wikimedia.org/T390731) (owner: 10Ayounsi) [10:55:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet [10:59:48] (03CR) 10Vgutierrez: [C:03+2] varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [11:00:04] mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1100). [11:01:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet [11:01:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet [11:03:31] (03CR) 10Jelto: [C:03+2] make helm3 alternative entry dependent on helm [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [11:04:58] syncing the updated patch to wmf.25 [11:07:39] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#10779851 (10ayounsi) Another tiny improvement would be to only prompt for yes/no when there is only 1 target device. [11:08:34] jouncebot: nowandnext [11:08:34] For the next 0 hour(s) and 51 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1100) [11:08:34] In 1 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1300) [11:09:50] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:10:11] (03CR) 10Ladsgroup: [C:03+1] "I haven't tested it but looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto) [11:10:12] (03Abandoned) 10Hnowlan: mw:maintenance: migrate all updatequerypages_ancientpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139438 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [11:16:15] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:16:22] (03PS1) 10Mvolz: Update zotero package-lock.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140160 [11:16:24] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:17:09] !log "Imported helm317 3.17.0-2 to bullseye-wikimedia and bookworm-wikimedia - T387548" [11:17:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:14] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [11:18:16] (03CR) 10Mvolz: [C:03+2] Update zotero package-lock.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140160 (owner: 10Mvolz) [11:18:21] finished with the sync to wmf.25, moving on to wmf.27 [11:18:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10779867 (10Stevemunene) The Hosts an-worker116[6-8] are verified with puppet disabled, and the steps followed ` stevem... [11:18:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10779868 (10Stevemunene) [11:19:50] (03Merged) 10jenkins-bot: Update zotero package-lock.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140160 (owner: 10Mvolz) [11:19:57] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001 (10ayounsi) 03NEW [11:20:55] !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply [11:21:19] !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply [11:21:53] !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply [11:22:22] !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply [11:23:23] !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply [11:23:52] !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply [11:26:46] (03PS1) 10Ayounsi: gNMIc start collecting data from pfw [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052) [11:27:24] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device pfw1a-eqiad [11:29:38] (03PS2) 10Ayounsi: gNMIc start collecting data from pfw [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052) [11:29:44] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device pfw1a-eqiad [11:30:29] (03PS1) 10Ayounsi: Enable gNMI on pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1140163 (https://phabricator.wikimedia.org/T390052) [11:31:05] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [11:32:09] syncing to wmf.27 now [11:34:28] (03Abandoned) 10Ayounsi: Enable gNMI on pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1140163 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [11:34:55] (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.pool: remove connection count [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto) [11:34:57] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: remove connection count [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto) [11:36:11] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet [11:37:22] !log enable gnmi on pfw1-eqiad - T390052 [11:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:27] T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 [11:38:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance [11:38:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T392806)', diff saved to https://phabricator.wikimedia.org/P75696 and previous config saved to /var/cache/conftool/dbconfig/20250430-113838-fceratto.json [11:42:14] jmm@cumin2002 drain-node (PID 3714991) is awaiting input [11:43:24] FYI, ml-etcd2002 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees) [11:43:31] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet [11:45:10] done with wmf.27 [11:45:28] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:57] !log Deployed patches for T392976 to wmf.25 and wmf.27 [11:46:00] (03PS1) 10Jelto: helm: remove duplicate alternatives::select entry [puppet] - 10https://gerrit.wikimedia.org/r/1140164 (https://phabricator.wikimedia.org/T387548) [11:46:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T392806)', diff saved to https://phabricator.wikimedia.org/P75697 and previous config saved to /var/cache/conftool/dbconfig/20250430-114734-fceratto.json [11:48:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet [11:48:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet [11:50:30] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms [11:51:06] (03CR) 10Ayounsi: [C:03+2] gNMIc start collecting data from pfw [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [11:51:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [11:53:14] hashar: I'm done with the security patches [11:53:52] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:54:50] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:55:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [11:58:18] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [12:01:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet [12:02:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [12:02:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:02:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [12:02:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P75698 and previous config saved to /var/cache/conftool/dbconfig/20250430-120242-fceratto.json [12:03:40] checking [12:03:44] !incidents [12:03:44] 6071 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [12:03:44] 6072 (UNACKED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [12:03:44] 6070 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw) [12:03:45] 6069 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad) [12:03:45] 6068 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [12:03:51] !ack 6072 [12:03:52] 6072 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [12:04:30] godog: analytics job [12:04:37] hah! thank you XioNoX [12:04:50] anything actionable atm ? [12:05:11] godog: pinging the person who ran it and asking them to stop ideally [12:05:24] with QoS the impact might be lower, checking [12:05:59] ok I'll look at how to identify analytics jobs [12:06:42] godog: looks like we're dropping "normal" queue packets, so that's not ideal [12:07:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [12:07:31] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [12:07:33] it also spiked and went down, so looks good for now [12:08:22] indeed, might come back I'd guess [12:08:30] jmm@cumin2002 drain-node (PID 3740759) is awaiting input [12:08:32] FWIW what I'm looking at is https://yarn.wikimedia.org/cluster/apps/RUNNING [12:08:49] it is quite opaque to me tho [12:13:11] similarly opaque is https://airflow.wikimedia.org [12:17:28] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:44] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:17:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P75699 and previous config saved to /var/cache/conftool/dbconfig/20250430-121749-fceratto.json [12:18:18] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:34] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:18:42] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:18:44] !log test `host-inbound-traffic system-services any-service` on mr1-ulsfo [12:18:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:27] (03PS3) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) [12:24:10] (03PS1) 10Majavah: admin: Temporarily remove Taavi's access [puppet] - 10https://gerrit.wikimedia.org/r/1140171 (https://phabricator.wikimedia.org/T393000) [12:24:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet [12:24:59] FYI, aux-k8s-etcd2005 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees) [12:25:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet [12:27:00] PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:28:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet [12:30:28] RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms [12:30:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet [12:30:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet [12:32:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T392806)', diff saved to https://phabricator.wikimedia.org/P75700 and previous config saved to /var/cache/conftool/dbconfig/20250430-123255-fceratto.json [12:33:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance [12:33:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:33:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T392806)', diff saved to https://phabricator.wikimedia.org/P75701 and previous config saved to /var/cache/conftool/dbconfig/20250430-123327-fceratto.json [12:36:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [12:37:18] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet [12:37:55] (03PS1) 10Majavah: common: Temporarily remove some keys [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) [12:38:09] jouncebot: now and next [12:38:10] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [12:38:36] I'll reboot alert2002 [12:38:45] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org [12:38:46] !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org [12:40:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T392806)', diff saved to https://phabricator.wikimedia.org/P75702 and previous config saved to /var/cache/conftool/dbconfig/20250430-124018-fceratto.json [12:41:36] yeah reboot-single doesn't work for alert hosts because they are not in icinga [12:41:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [12:42:00] !log filippo@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on alert2002.wikimedia.org with reason: kernel [12:42:49] jmm@cumin2002 drain-node (PID 3776728) is awaiting input [12:43:01] FYI, kubestagemaster2004 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees) [12:43:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet [12:43:08] !log filippo@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on alert2002.wikimedia.org with reason: new kernel [12:43:31] siiigh ok can't downtime manually even with --force, I'll just do it [12:43:36] moritzm: ack [12:43:44] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [12:43:58] (03CR) 10Btullis: [V:03+1 C:03+2] Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis) [12:44:08] !log reboot alert2002 [12:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:03] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Add private values files to resources deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [12:45:32] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:45:42] (03PS3) 10Btullis: Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 [12:46:34] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Add private values files to resources deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [12:46:53] (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans) [12:47:07] (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 (owner: 10Btullis) [12:47:29] (03Abandoned) 10Btullis: Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 (owner: 10Btullis) [12:48:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet [12:48:42] FIRING: [8x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:48:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet [12:49:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [12:49:37] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:49:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:49:47] (03CR) 10Jelto: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:49:57] FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:50:07] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet [12:50:28] FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:50:35] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms [12:50:54] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [12:50:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [12:51:12] (03PS5) 10Arnaudb: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) [12:51:36] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5416/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:51:58] (03CR) 10Arnaudb: "Thanks @jwodstrcil@wikimedia.org for confirming there was something fishy" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:52:41] FIRING: [8x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:53:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet [12:53:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [12:54:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5417/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [12:54:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:55:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P75703 and previous config saved to /var/cache/conftool/dbconfig/20250430-125525-fceratto.json [12:55:28] RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:57:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [12:58:48] 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10780146 (10Ladsgroup) 05Open→03Resolved Boldly closing: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=mwmaint1002&var-datasource=thanos&var-cluster=misc&from=... [12:58:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet [12:59:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet [12:59:45] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM so far, though there might be a followup later (see my comment on the task)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1300). [13:00:05] anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:11] o/ [13:00:39] o/ [13:00:46] anzx: do you mind if I update the commit message to add the english namespace aliases as well? makes for a more useful git log imho :) [13:00:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7001.magru.wmnet [13:00:56] (e.g. I just looked for some other commits adding portal namespaces for reference ^^) [13:01:12] anyway, I can deploy [13:01:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet [13:01:36] Lucas_WMDE: sure [13:01:58] (03PS3) 10Lucas Werkmeister (WMDE): mswikisource: add Karya (Work) and Gerbang (Portal) namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [13:03:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [13:03:08] (03CR) 10Elukey: [C:03+1] Stop passing krb2002 to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1140142 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:03:21] (03CR) 10Elukey: [C:03+1] Switch krb2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1140143 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff) [13:03:52] (03Merged) 10jenkins-bot: mswikisource: add Karya (Work) and Gerbang (Portal) namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx) [13:04:16] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1140129|mswikisource: add Karya (Work) and Gerbang (Portal) namespaces (T392984)]] [13:04:21] T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984 [13:04:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7001.magru.wmnet [13:04:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet [13:06:08] (03CR) 10Jelto: [V:03+1] gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:07:34] !log stevemunene@deploy1003 Started deploy [analytics/refinery@ea1cff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ea1cff2c] [13:07:59] !log Deploying Refinery at 1136103: Add mad.wikisource to pageview allowlist | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1136103 T391767 [13:07:59] !log deploying refinery at 1138395: Add rki.wikipedia to pageview allowlist | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1138395 T392499 [13:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:03] T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767 [13:08:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:08] T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499 [13:09:06] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1140129|mswikisource: add Karya (Work) and Gerbang (Portal) namespaces (T392984)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:09:08] Lucas_WMDE: checking [13:09:09] !log stevemunene@deploy1003 Finished deploy [analytics/refinery@ea1cff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ea1cff2c] (duration: 01m 35s) [13:09:50] Lucas_WMDE: looks good [13:09:51] !log stevemunene@deploy1003 Started deploy [analytics/refinery@ea1cff2]: Regular analytics weekly train [analytics/refinery@ea1cff2c] [13:09:55] !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync [13:09:57] great, thanks! [13:10:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet [13:10:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet [13:10:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P75704 and previous config saved to /var/cache/conftool/dbconfig/20250430-131032-fceratto.json [13:11:11] !log adjust fundraising NAT policies - T392843 [13:11:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:22] (03PS6) 10Arnaudb: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) [13:11:56] (03CR) 10Arnaudb: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:12:21] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:12:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet [13:13:02] (03PS2) 10Hnowlan: mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) [13:13:17] !log stevemunene@deploy1003 Finished deploy [analytics/refinery@ea1cff2]: Regular analytics weekly train [analytics/refinery@ea1cff2c] (duration: 03m 25s) [13:13:25] anyone wanna +1 (some of) the changes in https://phabricator.wikimedia.org/T392819? then I could roll those out as well [13:13:42] (03CR) 10Hashar: gerrit: split Gerrit and Gitiles proxy pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [13:14:28] (03PS1) 10Btullis: mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) [13:14:58] (03CR) 10Brouberol: [C:03+1] mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:16:21] !log stevemunene@deploy1003 Started deploy [analytics/refinery@ea1cff2] (thin): Regular analytics weekly train THIN [analytics/refinery@ea1cff2c] [13:16:26] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140129|mswikisource: add Karya (Work) and Gerbang (Portal) namespaces (T392984)]] (duration: 12m 10s) [13:16:33] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-cluster [13:16:33] T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984 [13:17:46] !log stevemunene@deploy1003 Finished deploy [analytics/refinery@ea1cff2] (thin): Regular analytics weekly train THIN [analytics/refinery@ea1cff2c] (duration: 01m 24s) [13:17:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet [13:18:05] Lucas_WMDE: thank you for deploying, i will create patch for defaultseachnamespace if local wiki member says on phab task it's ok to add [13:18:22] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10780239 (10ArthurPSmith) Hi - has this been done yet? I'm ready to test it on live Wi... [13:20:01] anzx: sounds good to me, thanks! [13:20:14] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: Maintenance [13:20:47] (03CR) 10Bking: [C:03+2] cirrus: re-enable completion index rebuild in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1139518 (owner: 10DCausse) [13:20:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2213.codfw.wmnet with reason: Maintenance [13:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:23:03] FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [13:23:05] !log jnuche@deploy1003 Installing scap version "4.158.0" for 2 host(s) [13:23:07] hello [13:23:09] !incidents [13:23:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet [13:23:09] 6073 (UNACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad) [13:23:10] 6072 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [13:23:10] 6071 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [13:23:10] 6070 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw) [13:23:10] 6069 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad) [13:23:11] 6068 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [13:23:14] !ack 6073 [13:23:14] 6073 (ACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad) [13:23:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet [13:23:33] wow, an actual runbook link! [13:23:39] sukhe: I'm in a meeting, though that paged earlier today too, https://phabricator.wikimedia.org/T392989 [13:23:44] ah thanks godog [13:24:05] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel) [13:24:19] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet [13:24:42] Lucas_WMDE: i forgot, could you run namespacedupes [13:24:54] !log jnuche@deploy1003 Installation of scap version "4.158.0" completed for 2 hosts [13:25:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T392806)', diff saved to https://phabricator.wikimedia.org/P75705 and previous config saved to /var/cache/conftool/dbconfig/20250430-132539-fceratto.json [13:25:55] oh right [13:25:56] one sec [13:25:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance [13:26:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T392806)', diff saved to https://phabricator.wikimedia.org/P75706 and previous config saved to /var/cache/conftool/dbconfig/20250430-132604-fceratto.json [13:26:13] (03PS1) 10Muehlenhoff: Add library hint for libcap2 [puppet] - 10https://gerrit.wikimedia.org/r/1140180 [13:26:14] sukhe: are you on it or should I? Don't want to step on each other's toes [13:26:15] (03CR) 10Jelto: [C:03+1] "lgtm now, thanks! I think `ssh_allowed_hosts` can be reduced to the production host only. But that's something we can test after the switc" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:26:21] there are no pages, but just to be safe [13:26:33] (03CR) 10Hashar: gerrit: lower connections to Gitiles from 25 to 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar) [13:26:46] the script says there are four :P [13:26:51] and 106 links [13:26:54] (03CR) 10Tiziano Fogli: [C:03+1] Add new PDU's in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul) [13:27:39] !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --comment=T392984 --follow -- namespaceDupes mswikisource --fix | tee T392984 [13:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:43] T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984 [13:27:47] Raine: thanks, I am trying to figure out what to do here [13:28:07] sukhe: ack, same here :D [13:28:30] anzx: done, thanks for the reminder! [13:28:42] (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [13:29:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [13:29:08] Raine: it's trending downwards at least [13:29:30] Lucas_WMDE: thanks , i didn't check for English names so i thought no pages were present [13:29:37] (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libcap2 [puppet] - 10https://gerrit.wikimedia.org/r/1140180 (owner: 10Muehlenhoff) [13:29:48] (I did a dry run of cleanupTitles just to check but there’s nothing to do there) [13:29:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet [13:31:02] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet [13:31:25] (03CR) 10Bking: [C:03+1] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [13:32:07] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10780327 (10tappof) @wiki_willy, please take a look at {T387866}. This will change how the row label is set and will also fix t... [13:32:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance [13:32:58] sukhe: yeah, it is right now, though the last 7 days have been a bit higher than before [13:32:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T392806)', diff saved to https://phabricator.wikimedia.org/P75707 and previous config saved to /var/cache/conftool/dbconfig/20250430-133258-fceratto.json [13:33:03] RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [13:33:08] ok then I guess :) [13:33:11] I will update the task [13:33:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: Maintenance [13:33:15] we've had this 2ish weeks ago and I'm not sure what the followup was [13:33:27] (other than creating the runbook) [13:33:29] yeah godo.g shared the task above [13:33:36] ah, right, thank you [13:34:57] (03CR) 10Tiziano Fogli: [C:03+1] "LGTM. While these definitions will no longer be needed for dashboarding purposes after merging https://gerrit.wikimedia.org/r/c/operations" [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul) [13:35:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet [13:35:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet [13:35:45] !log invoking `nodetool garbagecollect` on sessionstore1004 — T392989, T390514 [13:35:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:49] T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989 [13:36:32] ah thanks urandom :) [13:36:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet [13:36:49] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet [13:37:53] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan) [13:38:30] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet [13:40:52] sukhe: it's mainly diagnostic, I'm not sure if it will do anything (and this isn't The Way™ even if it does) [13:41:08] well, certainly better you running these vs at least me I guess :) [13:42:51] jmm@cumin2002 drain-node (PID 3838318) is awaiting input [13:43:21] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:43:31] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:44:00] indeed, thanks urandom ! [13:44:20] inflatador: I see you're reenabling an mw cronjob - if you were feeling adventurous and have more to do, we're currently migrating stuff to mw-cron (https://wikitech.wikimedia.org/wiki/Mw-cron_jobs) [13:44:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet [13:45:34] hnowlan Let me get a ticket started for that. We are migrating everything to OpenSearch, so this might be a good time to evaluate the mw-cron stuff more broadly [13:45:44] inflatador: nice! [13:46:15] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet [13:47:23] FYI, kubestagemaster2003 and ml-etcd2003 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees) [13:47:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet [13:48:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P75708 and previous config saved to /var/cache/conftool/dbconfig/20250430-134805-fceratto.json [13:48:19] (03CR) 10David Caro: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah) [13:48:31] (03CR) 10David Caro: [C:03+1] admin: Temporarily remove Taavi's access [puppet] - 10https://gerrit.wikimedia.org/r/1140171 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah) [13:48:34] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:49:22] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:49:30] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:49:58] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [13:50:05] (03PS1) 10David Caro: admin: temporarily remove dcaro access [puppet] - 10https://gerrit.wikimedia.org/r/1140181 (https://phabricator.wikimedia.org/T393000) [13:50:10] (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:50:10] guess I’m not getting a review for those patches in this window [13:50:15] !log UTC afternoon backport+config window done [13:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:34] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [13:50:46] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms [13:50:57] (03CR) 10Majavah: [C:03+2] common: Temporarily remove some keys [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah) [13:51:11] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10780399 (10Jelto) Thank you @MatthewVernon for digging into the logs. It was a bit tricky for me to find the actual path in the bu... [13:51:36] (03Merged) 10jenkins-bot: common: Temporarily remove some keys [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah) [13:51:44] (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis) [13:52:21] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet [13:52:42] FIRING: [8x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:52:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet [13:52:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet [13:54:41] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-cluster [13:55:16] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [13:55:55] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [13:57:07] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10780450 (10Eevans) >>! In T391544#10745829, @Eevans wrote: > > [ ... ] > > The goal would be to make this a... [13:57:47] (03PS1) 10Bartosz Dziewoński: EnotifNotifyJob: Forward-compat for wmf.27 jobs [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) [13:57:53] (03CR) 10Papaul: [C:03+2] Add new PDU's in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul) [13:58:42] FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1400) [14:00:49] !log installing libcap2 security updates [14:00:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:54] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: Maintenance [14:02:02] (03CR) 10Kamila Součková: [C:03+1] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French) [14:02:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet [14:02:58] (03CR) 10Kamila Součková: [C:03+1] helm: remove duplicate alternatives::select entry [puppet] - 10https://gerrit.wikimedia.org/r/1140164 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [14:03:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P75709 and previous config saved to /var/cache/conftool/dbconfig/20250430-140312-fceratto.json [14:04:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2209.codfw.wmnet with reason: Maintenance [14:06:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet [14:06:38] jouncebot: nowandnext [14:06:38] For the next 0 hour(s) and 53 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1400) [14:06:38] In 2 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700) [14:08:42] FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:09:03] FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [14:09:24] !incidents [14:09:24] You're not allowed to perform this action. [14:09:24] (03CR) 10Muehlenhoff: [C:03+1] "That looks good, but given that confd is an internal tool, let's maybe also create a task to fix the underlying behaviour? Ideally confd s" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [14:09:30] oh XD [14:09:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet [14:09:51] I'm in a meeting, though see T392989 [14:09:52] T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989 [14:09:55] !incidents [14:09:56] 6074 (UNACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2006:9100 node /srv codfw) [14:09:56] 6073 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad) [14:09:56] 6072 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqiad.wikimedia.org) [14:09:56] 6071 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [14:09:56] 6070 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw) [14:09:57] 6069 (RESOLVED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad) [14:09:57] 6068 (RESOLVED) [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw) [14:10:01] !ack 6074 [14:10:01] 6074 (ACKED) SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2006:9100 node /srv codfw) [14:10:11] cc urandom :( [14:11:27] sorry... [14:11:29] working on it [14:11:37] maybe we can create a silence [14:11:51] !log failover Ganeti master in codfw to ganeti2021 [14:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:00] thanks folks [14:12:14] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:12:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet [14:12:24] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:13:52] urandom: I can create the silence, how long do you think? [14:14:03] RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh [14:14:05] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [14:14:30] PROBLEM - ganeti-wconfd running on ganeti2032 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:14:35] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet [14:16:07] (03PS1) 10Hnowlan: mw::maintenance: move mostrevisions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) [14:17:29] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10780516 (10MatthewVernon) >>! In T391544#10749423, @Eevans wrote: >>>! In T391544#10746698, @MatthewVernon wr... [14:17:46] (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [14:18:17] (03PS1) 10Ssingh: wikimedia-ech.org: update zone file and add A/AAAA records [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) [14:18:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T392806)', diff saved to https://phabricator.wikimedia.org/P75710 and previous config saved to /var/cache/conftool/dbconfig/20250430-141819-fceratto.json [14:18:38] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance [14:18:42] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:18:46] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T392806)', diff saved to https://phabricator.wikimedia.org/P75711 and previous config saved to /var/cache/conftool/dbconfig/20250430-141845-fceratto.json [14:19:10] * Raine creating a silence for the SessionStore alerts for 12h [14:19:10] (03CR) 10Ssingh: "sigh, wrong file 😞" [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:20:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet [14:21:39] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet [14:22:15] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10780556 (10Jhancock.wm) [14:22:43] (03PS1) 10Máté Szabó: popup: Fix target user name for expired temporary account links [extensions/IPInfo] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140191 (https://phabricator.wikimedia.org/T393002) [14:23:05] (03Abandoned) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [14:23:10] (03Abandoned) 10GergesShamon: Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [14:23:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140191 (https://phabricator.wikimedia.org/T393002) (owner: 10Máté Szabó) [14:23:54] (03PS2) 10Ssingh: wikimedia-ech.org: update zone file and add A/AAAA records [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) [14:24:38] 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10780574 (10Jhancock.wm) @Andrew i can't use the offline script in netbox. looks like some of the interfaces are a little too complicated for... [14:25:40] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:26:37] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T392806)', diff saved to https://phabricator.wikimedia.org/P75712 and previous config saved to /var/cache/conftool/dbconfig/20250430-142636-fceratto.json [14:26:44] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:27:15] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet [14:28:29] (03CR) 10Ssingh: "From the durum host:" [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:28:32] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [14:29:33] !log installing ruby2.7 security updates [14:29:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:32] (03PS1) 10GergesShamon: [arwiki] Change logo and tagline with sync wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858) [14:31:44] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015 (10RobH) 03NEW [14:32:14] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10780630 (10RobH) [14:32:56] (03PS1) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) [14:33:41] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [14:33:47] (03PS1) 10AikoChou: ml-services: add edit-check-cpu isvc for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 [14:34:48] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [14:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:35:34] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon) [14:35:56] (03Merged) 10jenkins-bot: popup: Fix target user name for expired temporary account links [extensions/IPInfo] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140191 (https://phabricator.wikimedia.org/T393002) (owner: 10Máté Szabó) [14:36:24] !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1140191|popup: Fix target user name for expired temporary account links (T393002)]] [14:36:29] T393002: IPInfo: IPInfo popup for expired temporary accounts is not working - https://phabricator.wikimedia.org/T393002 [14:37:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10780656 (10fnegri) [14:39:30] FIRING: Emergency syslog message: Alert for device ssw1-e1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:39:37] (03CR) 10Bking: [C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [14:39:42] (03PS10) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) [14:39:45] (03CR) 10Bking: [V:03+2 C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [14:39:52] (03CR) 10Federico Ceratto: "Small update." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [14:39:55] (03PS3) 10Ebernhardson: Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) [14:40:00] (03CR) 10Bking: [C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [14:40:02] (03CR) 10Bking: [V:03+2 C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson) [14:40:29] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1236.eqiad.wmnet with reason: Maintenance [14:40:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2218.codfw.wmnet with reason: Maintenance [14:40:52] (03CR) 10Vgutierrez: [C:03+1] "in the long term I'm guessing we should include wikimedia-ech.org as part of our unified cert and serve this from the CDN or as a one-off " [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:40:56] (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: move mostrevisions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [14:41:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P75713 and previous config saved to /var/cache/conftool/dbconfig/20250430-144144-fceratto.json [14:41:52] (03CR) 10Ssingh: "Yes, that's correct. When we get to that, we should certainly include it there." [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:42:04] (03CR) 10Ssingh: [C:03+2] wikimedia-ech.org: update zone file and add A/AAAA records [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [14:42:11] !log sukhe@dns1004 START - running authdns-update [14:42:53] !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1140191|popup: Fix target user name for expired temporary account links (T393002)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:42:58] T393002: IPInfo: IPInfo popup for expired temporary accounts is not working - https://phabricator.wikimedia.org/T393002 [14:44:03] !log mszabo@deploy1003 mszabo: Continuing with sync [14:44:16] (03PS1) 10Bking: cirrussearch: fix typo in systemd timer resource [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) [14:44:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) (owner: 10Bking) [14:44:30] RESOLVED: Emergency syslog message: Device ssw1-e1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:44:41] !log sukhe@dns1004 END - running authdns-update [14:45:57] !log invoking `nodetool garbagecollect` on sessionstore2004 — T390514, T392989 [14:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:02] T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989 [14:47:10] (03CR) 10Bking: [C:03+2] opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse) [14:47:28] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: fix typo in systemd timer resource [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) (owner: 10Bking) [14:47:30] (03PS1) 10Arturo Borrero Gonzalez: admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200 [14:47:38] (03CR) 10Bking: [C:03+2] cirrussearch: fix typo in systemd timer resource [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) (owner: 10Bking) [14:48:06] (03CR) 10CI reject: [V:04-1] admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200 (owner: 10Arturo Borrero Gonzalez) [14:48:16] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10780708 (10RobH) New remote hands entered to get this fixed: Case Order #01053614 > Directions for remote hands to repair our link between cr3 an... [14:48:20] (03CR) 10Hnowlan: [C:03+2] mw::maintenance: move mostrevisions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [14:49:12] (03PS2) 10Arturo Borrero Gonzalez: admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200 [14:49:31] FIRING: Emergency syslog message: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:50:47] !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140191|popup: Fix target user name for expired temporary account links (T393002)]] (duration: 14m 22s) [14:50:52] T393002: IPInfo: IPInfo popup for expired temporary accounts is not working - https://phabricator.wikimedia.org/T393002 [14:51:50] (03PS2) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) [14:52:45] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:53:41] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:53:53] (03CR) 10Arnaudb: [C:03+1] "Done" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [14:53:58] (03PS4) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) [14:54:12] !log installing werkzeug security updates [14:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:31] RESOLVED: Emergency syslog message: Device lsw1-e1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [14:55:01] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:55:12] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:55:25] hello, I'll be switching Gerrit over in 20min (15:15 UTC), operation should take a few minutes, apologies for the temporarily unavailability, I'll post any relevant update here [14:56:19] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 (owner: 10AikoChou) [14:56:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P75715 and previous config saved to /var/cache/conftool/dbconfig/20250430-145651-fceratto.json [14:57:17] FIRING: [25x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:43] FIRING: [31x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: Maintenance [14:59:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: Maintenance [14:59:33] !log invoking `nodetool garbagecollect` on sessionstore[2005-2006].codfw.wmnet,sessionstore[1005-1006].eqiad.wmnet — T390514, T392989 [14:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:38] T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989 [15:01:14] (03CR) 10JHathaway: "Yeah, I am not super happy with the less holistic approach in this patch. However, I don't think a confd change is likely, given the seman" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [15:01:23] !log installing postgresql-15 security updates [15:01:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:17] FIRING: [46x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:02:24] (03CR) 10Dzahn: [C:03+1] gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:03:21] (03CR) 10Vgutierrez: [C:04-1] haproxykafka: service unit brought by deb package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur) [15:03:42] FIRING: [54x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:03:47] (03CR) 10AikoChou: [C:03+2] ml-services: add edit-check-cpu isvc for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 (owner: 10AikoChou) [15:05:41] (03Merged) 10jenkins-bot: ml-services: add edit-check-cpu isvc for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 (owner: 10AikoChou) [15:05:44] (03CR) 10Dzahn: [C:03+1] gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:06:06] (03CR) 10Herron: [C:03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1140135 (https://phabricator.wikimedia.org/T392994) (owner: 10Filippo Giunchedi) [15:06:43] jouncebot: nowandnext [15:06:44] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [15:06:44] In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700) [15:07:04] (03CR) 10Hashar: [C:03+2] EnotifNotifyJob: Forward-compat for wmf.27 jobs [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński) [15:07:17] FIRING: [68x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:07:34] (03CR) 10Hashar: [C:03+2] "I am +2ing this now to get CI to kick. I will deploy it after Gerrit has been switched over to another server." [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński) [15:08:42] FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:08:42] FIRING: [70x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:36] (03Merged) 10jenkins-bot: EnotifNotifyJob: Forward-compat for wmf.27 jobs [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński) [15:11:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T392806)', diff saved to https://phabricator.wikimedia.org/P75717 and previous config saved to /var/cache/conftool/dbconfig/20250430-151158-fceratto.json [15:12:16] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance [15:12:17] FIRING: [91x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:12:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T392806)', diff saved to https://phabricator.wikimedia.org/P75718 and previous config saved to /var/cache/conftool/dbconfig/20250430-151222-fceratto.json [15:12:50] (03CR) 10Muehlenhoff: [C:03+1] "Ack, thanks for the additional context" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway) [15:14:08] Will start gerrit switchover in 1 min [15:15:11] * arnaudb starts [15:15:17] (03CR) 10Arnaudb: [C:03+2] gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:15:30] (03CR) 10Arnaudb: [C:03+2] gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:15:49] !log arnaudb@dns1004 START - running authdns-update [15:16:32] !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit2002.wikimedia.org to gerrit1003.wikimedia.org [15:17:17] FIRING: [115x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:32] 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10780806 (10MatthewVernon) Two thoughts - first, sorry, I was rebooting all the things today because of T392804 which //shouldn't//... [15:18:43] FIRING: [118x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:36] FIRING: Emergency syslog message: Alert for device lsw1-e3-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:20:06] !log Removed @joaquin (former staff) from https://www.npmjs.com/settings/wikimedia/members [15:20:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T392806)', diff saved to https://phabricator.wikimedia.org/P75719 and previous config saved to /var/cache/conftool/dbconfig/20250430-152007-fceratto.json [15:20:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:24] !log installing ucf security updates [15:20:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:59] !log Removed @nrayio (former staff [[User:NRay (WMF)]]) from https://www.npmjs.com/settings/wikimedia/members [15:21:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:49] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:21:54] !log gerrit failover in progress [15:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:17] FIRING: [141x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:22:45] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:23:33] Is gerrit outage planned? [15:23:43] FIRING: [147x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:24:26] HouseOfM: yes, it is planned [15:24:36] RESOLVED: Emergency syslog message: Device lsw1-e3-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:24:43] mutante: Thanks :) [15:25:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:14] any alert relating to "something git pull" is indirect alerting about gerrit failover [15:27:17] FIRING: [153x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:42] FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:13] !log arnaudb@dns1004 END - running authdns-update [15:29:26] arnaudb@cumin1002 failover (PID 963080) is awaiting input [15:30:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:30:45] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10780837 (10Jgreen) 05Open→03Resolved >>! In T392579#10777103, @Jhancock.wm wrote: > @Jgreen reseated all the connections to the backplane. server came up. I checked... [15:33:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10780879 (10MoritzMuehlenhoff) [15:33:45] puppet agent running on hosts, ETA 3 to 5 minutes [15:34:36] FIRING: Emergency syslog message: Alert for device ssw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:35:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P75720 and previous config saved to /var/cache/conftool/dbconfig/20250430-153516-fceratto.json [15:36:45] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:37:30] arnaudb@cumin1002 failover (PID 963080) is awaiting input [15:37:35] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:38:39] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.gerrit.failover (exit_code=97) from gerrit2002.wikimedia.org to gerrit1003.wikimedia.org [15:38:42] RESOLVED: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:36] FIRING: [2x] Emergency syslog message: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:40:25] RESOLVED: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:25] jouncebot: refresh [15:43:26] I refreshed my knowledge about deployments. [15:43:28] jouncebot: nowandnext [15:43:28] No deployments scheduled for the next 1 hour(s) and 16 minute(s) [15:43:28] In 1 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700) [15:43:32] I will deploy I'll merge https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1140184 [15:43:38] for the train [15:44:36] RESOLVED: Emergency syslog message: Device lsw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [15:50:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P75722 and previous config saved to /var/cache/conftool/dbconfig/20250430-155023-fceratto.json [15:50:31] (03PS1) 10Mforns: Add file and filetypes tables to the mediawiki-not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800) [15:52:14] (03CR) 10Bking: "Adding Reuven to list of reviewers per IRC conversation. This is a pretty old patch, and we wanna make sure we don't end up breaking envoy" [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [15:53:21] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [15:53:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [15:53:47] (03PS1) 10Elukey: icinga: skip services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) [15:55:09] (03CR) 10Elukey: "I'll wait for Riccardo to review this but the basic functionality should be there!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [15:56:29] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10780975 (10elukey) I filed https://gerrit.wikimedia.org/r/c/operati... [15:58:58] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [15:59:58] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:02:47] !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:02:52] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:02:52] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:04:11] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10781006 (10ArthurPSmith) Since it's well after 10:00 UTC I gave it a try - problem is... [16:04:36] FIRING: Emergency syslog message: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:05:21] (03CR) 10CI reject: [V:04-1] icinga: skip services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [16:05:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T392806)', diff saved to https://phabricator.wikimedia.org/P75724 and previous config saved to /var/cache/conftool/dbconfig/20250430-160530-fceratto.json [16:05:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance [16:05:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T392806)', diff saved to https://phabricator.wikimedia.org/P75725 and previous config saved to /var/cache/conftool/dbconfig/20250430-160556-fceratto.json [16:07:35] (03PS7) 10Krinkle: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [16:07:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:07:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [16:09:24] (03Merged) 10jenkins-bot: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:09:26] (03Merged) 10jenkins-bot: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [16:09:36] RESOLVED: Emergency syslog message: Device lsw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:10:09] (03PS1) 10Hnowlan: mw::maintenance: migrate mostlinked job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140214 (https://phabricator.wikimedia.org/T388534) [16:11:08] Gerrit is working properly, _but_ we had a slight hiccup on replication where we'll have to dig a bit further to see why we have trouble detecting "replica status" efficiently after the switchover. Consequentially, puppet-agent has been disabled on gerrit2002 (the replica), to ensure it stays in a consistent state until we figure this situation [16:11:08] out. Anyway, thank you all for your patience [16:11:19] hashar: "16:09:31 The following are unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.44.0-wmf.25:" [16:11:37] Oops I see your comment now there, "I am +2ing this now to get CI to kick. I will deploy it after Gerrit has been switched over to another server." [16:11:58] yes [16:12:03] I'll roll it out now [16:12:04] sorry I am going to deploy it rightn ow [16:12:10] I was waiting for the Gerrit maintenance to be completed [16:12:19] arnaudb: make sure to `!log` it :) [16:12:25] I have a backport command running with two config patches [16:12:28] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]] [16:12:30] oh you're right! sorry [16:12:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T392806)', diff saved to https://phabricator.wikimedia.org/P75726 and previous config saved to /var/cache/conftool/dbconfig/20250430-161234-fceratto.json [16:12:36] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [16:12:36] T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994 [16:12:36] !log Gerrit maintenance over [16:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:51] oh there is another backport [16:12:57] Krinkle: you are deploying right? [16:13:14] I pressed "y" to the unexpected commit yes [16:13:23] I don't know what happens if I press No. [16:13:41] we roll back to last known good version: the perl based wiki software [16:13:43] does it undo that patch or exclude it? or does it abort everything and prompt the next person? or does it forget after one person sees the prompt? [16:13:54] I don't know to be fair. I guess it will just stop there [16:14:05] leave the unexpected commit in place for human to investiate [16:14:27] until I guess someone pull the patch [16:14:38] well I don't know [16:14:44] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10781063 (10MMiller_WMF) I am Madalina's manager and I approve her access to this data and these tools. [16:14:56] hm.. well that depends on how it discovers it. it's not obvious to me that once it pulls it down it will know next time that it is still new/undeployed. [16:15:22] This could use better documentation and/or explicit prompt what it will do. [16:16:05] If you answer no, the backport is cancelled. [16:16:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:17:19] And, indeed, if you run another operation after that, the new operation won't re-complain about the prior unexpected commit. [16:18:37] if it uses git fetch and git rebase (which the manual steps used to recommend, instead of git pull) then it would presumably be able to discovre it next time as well. [16:18:43] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:18:53] !log krinkle@deploy1003 krinkle, pppery: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [16:18:56] i.e. it will never have applied it to mediawiki-staging until after saying 'y' yes [16:18:57] (03PS1) 10Hnowlan: mw::maintenance: migrate all remaining general updatequerypages jobs [puppet] - 10https://gerrit.wikimedia.org/r/1140216 (https://phabricator.wikimedia.org/T388534) [16:18:59] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [16:19:00] T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994 [16:19:31] (03CR) 10VolkerE: [C:03+1] missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:19:36] FIRING: Emergency syslog message: Alert for device lsw1-f3-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:19:39] dancy: assuming it doesn't work that way, how does it work? I guess based on the security patch logic, it rebuilds the tree somewhere in a temporary space, but then how does it find that something is new? [16:21:09] (03PS1) 10AOkoth: aphlict: revert eqiad host to active [puppet] - 10https://gerrit.wikimedia.org/r/1140217 (https://phabricator.wikimedia.org/T392128) [16:22:18] !log krinkle@deploy1003 krinkle, pppery: Continuing with sync [16:23:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [16:23:32] Krinkle: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/backport.py?ref_type=heads#L1024 is the code that does the checking. Looks like it still does use `git fetch`, so I think we're still good. [16:23:50] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [16:24:36] RESOLVED: Emergency syslog message: Device lsw1-f3-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:24:49] dancy: hm.. so maybe it will prompt the next person as well! [16:24:56] (03PS1) 10AOkoth: wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) [16:25:06] Yes, I would expect so after reevaluating. [16:25:09] unless they pass that previous-unexpected change to scap-backport as argument, I guess. [16:25:15] Right [16:25:29] I might try that sometime. [16:26:11] (03CR) 10Hashar: [C:03+2] "There was a Gerrit maintenance that started immediately after the patch got merged. It is now being deployed by Timo as part of another d" [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński) [16:26:56] 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10781132 (10Papaul) [16:27:34] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029 (10RobH) 03NEW [16:27:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P75727 and previous config saved to /var/cache/conftool/dbconfig/20250430-162741-fceratto.json [16:27:53] (03PS1) 10Lucas Werkmeister (WMDE): Temporarily remove lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1140219 [16:28:04] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10781146 (10RobH) a:03BTullis Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se... [16:28:09] Krinkle: I'll also test in train-dev later and let you know what the results are. [16:28:17] ack [16:28:21] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10781155 (10RobH) [16:28:55] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]] (duration: 16m 26s) [16:29:01] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [16:29:01] T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994 [16:32:36] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030 (10RobH) 03NEW [16:32:59] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10781214 (10RobH) a:03BTullis Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the ne... [16:34:56] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:35:33] (03CR) 10Hashar: [C:03+1] "Thank you for using the `primary` / `replica` semantic. The license looks good to me, at least it is not re-licensing to Apache 2.0 :]" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn) [16:35:54] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [16:38:16] Krinkle: let me know when the patch are deployed and I will proceed with the train [16:38:25] hashar: it's done. [16:38:31] awesome [16:39:12] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10781244 (10RobH) [16:41:31] (03CR) 10Tiziano Fogli: [C:03+2] admin: temporarily remove dcaro access [puppet] - 10https://gerrit.wikimedia.org/r/1140181 (https://phabricator.wikimedia.org/T393000) (owner: 10David Caro) [16:41:47] (03CR) 10Tiziano Fogli: [C:03+2] admin: Temporarily remove Taavi's access [puppet] - 10https://gerrit.wikimedia.org/r/1140171 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah) [16:42:01] (03CR) 10Tiziano Fogli: [C:03+2] admin: move jiji to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) (owner: 10Effie Mouzeli) [16:42:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P75728 and previous config saved to /var/cache/conftool/dbconfig/20250430-164248-fceratto.json [16:42:58] I am runinng the train [16:43:21] (03PS4) 10TrainBranchBot: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:43:21] (03CR) 10TrainBranchBot: [C:03+2] missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:43:21] (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) [16:43:22] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [16:44:13] that's weird. https://gerrit.wikimedia.org/r/1138922 was already deployed? [16:44:58] (03CR) 10Bking: [C:03+2] refactor(opensearch): use Netbox to get rack / row information [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel) [16:45:22] hashar: oh, gerrit lost some events during hte switch I guess? [16:45:32] hmmm maybe [16:45:36] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138508 is back to how it was an hour ago [16:45:39] missing the latest rebase and merge [16:45:57] but it is merged? [16:45:57] 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10781287 (10Dzahn) imho we should have something that effectively notifies a team (automatic task, email) so next time we don't need to rely on manually created tickets by users [16:45:58] what does that mean for the underlying git-repo? [16:46:12] (03Merged) 10jenkins-bot: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [16:46:14] (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [16:46:33] it was merged yes, but then the server switched to a version that is in the past [16:46:39] so now prod and gerrit are forked [16:46:48] my local also deviates in its remote reflection [16:46:51] holy shit [16:47:26] yikes [16:47:27] was gerrit meant to be in read-only mode during this maintenance? Or was it meant to catch up afterward. [16:47:28] mutante: arnaudb: sobanski: thcipriani: so looks like the Gerrit switch over caused repos to rollback in time [16:48:09] hashar: blarg. [16:48:14] both git repos and gerrit db are back in time, i.e. comments and votes also missing [16:49:22] this has happened before. The last time it happened we didn't have the --delete flag for rsync which caused git to look at loose refs rather than packed refs [16:49:36] FIRING: Emergency syslog message: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:49:46] that is why I was aksing why we did a rsync of the git repos :) [16:49:48] but then [16:49:49] Krinkle: do you have an example for investigation? [16:50:01] if a change is merged on the primary, itshould be replicated to the secondary [16:50:06] err to the replica [16:50:10] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138508 [16:50:19] thanks [16:50:20] this change was rebased by me, and then +2'ed/merged by train bot [16:50:32] irc backscroll has the receipts of this events [16:50:51] apparently part of the database says it is merged i.e. the relation chain [16:51:10] but Gitiles and the change page are back in time [16:51:49] we also have some kind of audit trail in /var/log/zuul/zuul.log.* on contint1002.wikimedia.org [16:52:17] confirmed [16:52:25] well wait [16:52:33] there is no gate-and-submit there [16:52:50] zuul.log.2025-04-24:2025-04-24 20:51:37,503 INFO zuul.IndependentPipelineManager: Reporting item in test-prio>, actions: [] [16:53:07] which matches the last comment on Gerrit [16:53:24] github mirrors have also stopped updating [16:53:24] somewhere above I see: Change depends on changes [, ] [16:54:04] and that is about it [16:54:17] so I don't think that change ever got a +2 or a merge. At least according to CI logs [16:54:36] RESOLVED: Emergency syslog message: Device lsw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:54:39] ok, now confirmed: git for-each-ref shows fad64230b40174d5f90f2095e1eb0f8561c96421 commit refs/changes/08/1138508/meta same as refs/changes/08/1138508/meta file on disk that is from 2025-04-24 [16:54:43] (03CR) 10BCornwall: [C:03+1] wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth) [16:55:10] meanwhile grep 1138508 packed-refs -> 90167f46357593f19e0a5ad8fea8469b0a66a018 refs/changes/08/1138508/meta [16:55:15] and that's a change from today [16:56:09] this is the same as: https://phabricator.wikimedia.org/T236114 [16:56:45] :-( [16:56:51] * hashar has PTSD [16:56:54] :B [16:57:25] the problem with that one is it went on for a day before we found the issue, this has been going on an hour. If we stop everything now we could lose an hour of work. [16:57:39] or at least we'd have less to correct if we can correct it easily [16:57:49] ^ arnaudb mutante [16:57:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T392806)', diff saved to https://phabricator.wikimedia.org/P75729 and previous config saved to /var/cache/conftool/dbconfig/20250430-165754-fceratto.json [16:58:05] so we stop Gerrit to prevent further diverting? [16:58:15] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance [16:59:26] Catching up on the scroll back [16:59:45] the thing I don't get is Timo claims 1138508 got merged but I don't see those events in the Zuul logs [16:59:55] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise and row/rack awareness T391392 T390100 - bking@cumin2002 - T390100 [17:00:01] T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392 [17:00:01] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700) [17:00:18] (CR) TrainBranchBot: [C:+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: Pppery) [17:00:19] (Merged) jenkins-bot: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: Pppery) [17:00:24] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]] [17:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:29] T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114 [17:00:30] but I do see the backports in https://sal.toolforge.org/production?p=0&q=1138508&d= [17:00:30] T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994 [17:00:33] ah ok [17:00:44] * hashar digs Zuul logs more [17:01:06] (03PS1) 10AOkoth: vrts: add junk queue count and remove mobile queue [puppet] - 10https://gerrit.wikimedia.org/r/1140207 [17:01:17] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10781355 (10ArthurPSmith) Hmm, it seems to have resolved now. Maybe I'll try another o... [17:01:18] 1138508,7> in gate-and-submit>, actions: [] [17:01:22] coming back [17:01:34] if you do: sudo -u gerrit2 git log 90167f46357593f19e0a5ad8fea8469b0a66a018 [17:01:39] because I was grepping zuul.log.2025-04* and not zuul.log [17:01:49] in /srv/gerrit/git/operations/mediawiki-config.git on gerrit1003 [17:01:53] Krinkle: data loss confirmed, thank you :) [17:02:03] you can see "Change has been successfully rebased and submitted" [17:02:06] for that change [17:02:17] thcipriani: do we shut down Gerrit right now? [17:02:24] arnaudb: can you confirm if the rsync we ran had the --delete flag? [17:02:45] if not, yes, we should shut down gerrit [17:03:50] I'd say let's shut down anyway and we can dig into it afterwards [17:04:00] ^ sounds good, let's do it [17:04:19] and disable Puppet [17:04:34] yes thcipriani [17:04:38] confirmed [17:04:53] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:05:11] both instances are shutting down [17:05:24] we'll investigate from here, lets jump back on the call we were in if you want to [17:05:29] sounds good [17:05:31] cause Puppet will bring the systemd unit back up [17:05:39] there is some ensure => running [17:05:48] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:09:25] FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:10:42] FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:13:19] !log gerrit incident following switchover https://phabricator.wikimedia.org/T393034 [17:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:14:42] We have shutdown Gerrit to prevent further issues [17:15:42] FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:16:31] Krinkle: funnily I have a browser tab that shows the parent change and it shows the change you mentioned as merged [17:16:36] so I did not even had to look in the logs [17:19:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:00] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply [17:22:21] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply [17:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [17:24:38] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:27:51] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise and row/rack awareness T391392 T390100 - bking@cumin2002 - T390100 [17:27:57] T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392 [17:27:57] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [17:31:57] !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-f1-codfw.mgmt.codfw.wmnet [17:32:00] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [17:32:57] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise pkg as I forgot last time T390100 - bking@cumin2002 - T390100 [17:33:01] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [17:34:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:34:34] on-calls are standing by if we can help. [17:34:36] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:39:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:53] thanks sukhe investigation is still in progress to figure out the root cause [17:43:02] feel free to highlight me directly for live update [17:43:06] 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10781527 (10ArthurPSmith) Nope - new property frozen also as soon as I added an exampl... [17:46:15] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10781572 (10Tgr) MediaWiki version: {T393038} [17:49:15] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042 (10RobH) 03NEW [17:49:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:49:36] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10781610 (10RobH) [17:50:50] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10781616 (10wiki_willy) Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh... [17:53:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:54:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:56:12] where can one follow along the gerrit incident? the usual IRC channels I'd guessed are all relatively silent [17:57:00] mping you the meet room [17:57:26] https://docs.google.com/document/d/1kh6vYGLdGIEpN-EsUaXb6u82gNW5TvBkoI_yCPjB6_8/edit?tab=t.0 [17:57:35] here is the doc [17:57:48] we're still investigating around the root cause [18:00:04] hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1800) [18:00:19] the train is blocked [18:00:27] due to Gerrit being frozen [18:00:35] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet [18:00:37] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise pkg as I forgot last time T390100 - bking@cumin2002 - T390100 [18:00:45] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [18:02:40] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044 (10RobH) 03NEW [18:03:01] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10781683 (10RobH) [18:07:22] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet [18:14:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:15:38] Small update: we're still narrowing down what happened to make sure service interruption won't occur again after it is considered fixed. [18:16:39] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045 (10RobH) 03NEW [18:17:00] 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10781718 (10RobH) [18:18:19] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10781722 (10Madalina) @tappof I had an access is denied error before. Everything seems ok now, thank you! [18:19:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:31:49] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10781750 (10RobH) [18:33:34] What happened to Gerrit? [18:33:44] there is an incident in progress [18:34:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:34:40] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:34:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:02] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10781758 (10RobH) Please note we have two open procurement requests for this host. Please do NOT discuss pric... [18:39:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:48:48] FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:49:43] ^ a bunch of stuff is failing because of the Gerrit thing [18:49:52] https://puppetboard.wikimedia.org/nodes?status=failed [18:50:39] so nothing to worry as such, it's expected [18:50:56] Yes, I've silenced alerts for the o11y hosts. [18:51:01] thanks! [18:51:07] sukhe: Do you think I should extend the silence for all hosts? [18:51:18] Puppet is going to fail on many hosts that can't communicate with Gerrit. [18:51:31] denisse: I would say no I think, since they are not paging plus we miss some other related alerts in case we don't remove the silence [18:51:46] if something pages (I doubt anything does?) we can do it [18:52:00] SGTM, yes, the alert is not paging. [18:52:52] I only silenced the Puppet Failure alerts for the o11y hosts. [18:53:15] yeah [18:53:48] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:39] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10781822 (10wiki_willy) Thanks @tappof, that sounds good! >>! In T387231#10780327, @tappof wrote: > @wiki_willy, please take... [19:06:40] pt1979@cumin2002 provision (PID 4091458) is awaiting input [19:14:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:16:58] PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [19:18:31] FIRING: [4x] ProbeDown: Service gerrit2002:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:19:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:19:44] !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit[1003,2002-2003].wikimedia.org with reason: Debugging [19:22:13] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10781850 (10ABran-WMF) [19:28:11] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1015.eqiad.wmnet with OS bullseye [19:28:22] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye [19:28:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:31:00] 10SRE-swift-storage, 06Commons: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10781886 (10Pppery) [19:43:42] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1015.eqiad.wmnet with reason: host reimage [19:47:07] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1015.eqiad.wmnet with reason: host reimage [19:51:26] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1016.eqiad.wmnet with OS bullseye [19:51:36] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1016.eqiad.wmnet with OS bullseye [19:52:06] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781922 (10VRiley-WMF) [19:57:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-f1-codfw.mgmt.codfw.wmnet [20:00:04] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T2000). [20:00:04] _Gerges: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:56] <_Gerges> Here [20:03:14] i believe the current gerrit outage means that the window is cancelled [20:03:56] yes sorry we can not deploy currently [20:04:20] we had some data issue with our Gerrit instance and operations/mediawiki-config has been hit [20:04:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:04:27] _Gerges: ^ [20:04:54] _Gerges: it is better to schedule later. I am not sure whether many people will be around tomorrow though due to May 1st [20:05:26] <_Gerges> OK [20:07:19] !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1016.eqiad.wmnet with reason: host reimage [20:09:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:20] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:10:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1016.eqiad.wmnet with reason: host reimage [20:11:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:11:04] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1015.eqiad.wmnet with OS bullseye [20:11:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye completed: - ms-fe1015 (**WARN**... [20:14:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:43] FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:19:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:24:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:28:47] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:28:48] FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:29:25] FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:30:39] PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [20:30:42] RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:31:52] vriley@cumin1002 reimage (PID 1240354) is awaiting input [20:33:08] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002" [20:33:09] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1016.eqiad.wmnet with OS bullseye [20:33:15] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-fe1016.eqiad.wmnet with OS bullseye completed: - ms-fe1016 (**PASS**... [20:33:48] RESOLVED: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:34:25] RESOLVED: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:17] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781997 (10VRiley-WMF) 05Open→03Resolved This is complete. [20:43:19] (03PS1) 10Bvibber: Fix localization for validation errors checking tabular data [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140228 (https://phabricator.wikimedia.org/T389126) [20:43:45] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in codfw - https://phabricator.wikimedia.org/T393053 (10RobH) 03NEW [20:44:17] (03PS1) 10Bvibber: Check for content validity before extracting license [extensions/JsonConfig] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140229 (https://phabricator.wikimedia.org/T389125) [20:45:37] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in codfw - https://phabricator.wikimedia.org/T393053#10782038 (10RobH) a:03akosiaris Alex, We didn't get racking details on the ordering task T392715, so we need to get them from you before the hosts arrive. Please populate the task de... [20:45:40] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in codfw - https://phabricator.wikimedia.org/T393053#10782042 (10RobH) [20:47:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/JsonConfig] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140229 (https://phabricator.wikimedia.org/T389125) (owner: 10Bvibber) [20:47:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140228 (https://phabricator.wikimedia.org/T389126) (owner: 10Bvibber) [20:52:29] 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-eqiad - https://phabricator.wikimedia.org/T393053#10782066 (10RobH) [20:53:15] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054 (10RobH) 03NEW [20:53:19] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054#10782084 (10RobH) [20:53:57] 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054#10782085 (10RobH) a:03akosiaris Alex, We didn't get racking details on the ordering task T392714, so we need to get them from you before the hosts arrive. Please populate the tas... [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T2100) [21:11:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new knn plugin - bking@cumin2002 - T390100 [21:11:12] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new knn plugin - bking@cumin2002 - T390100 [21:11:15] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [21:12:20] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100 [21:14:39] (03PS7) 10Hashar: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [21:15:12] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100 [21:15:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100 [21:16:15] (03PS1) 10Hashar: Review access change [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140241 [21:16:54] (03PS2) 10Hashar: Allow force push to reconstruct repo [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140241 (https://phabricator.wikimedia.org/T393034) [21:17:04] (03CR) 10Hashar: [V:03+2 C:03+2] Allow force push to reconstruct repo [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140241 (https://phabricator.wikimedia.org/T393034) (owner: 10Hashar) [21:18:51] (03PS1) 10Hashar: Revert "Allow force push to reconstruct repo" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140242 (https://phabricator.wikimedia.org/T393034) [21:18:52] (03CR) 10Pppery: "(Noting for the record: this change was approved and deployed by Krinkle using scap backport about 5 hours ago, however the data about tha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery) [21:19:01] (03CR) 10Hashar: [V:03+2 C:03+2] Revert "Allow force push to reconstruct repo" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140242 (https://phabricator.wikimedia.org/T393034) (owner: 10Hashar) [21:22:08] (03CR) 10Hashar: "Due to a split brain between Gerrit instances (T393034) this commit was merged against a wrong version of the branch but has never been de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot) [21:22:47] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [21:23:27] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10782143 (10RobH) a:05RobH→03cmooney @cmooney : > "Created by: mmariscalmata The following has been completed: > > Retrieve package #1... [21:27:07] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10782147 (10cmooney) Thanks @RobH. It looks good so far, this is the graph we need to keep an eye on: https://grafana.wikimedia.org/goto/SVEEkIbHR... [21:27:28] !log Deployment server: reseted /srv/mediawiki-staging to 7a3327588 / https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138508 # T393034 [21:27:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:27:33] T393034: Investigate out of date refs following gerrit switchover - https://phabricator.wikimedia.org/T393034 [21:28:06] the other deployment server might need a sync [21:29:07] (03PS1) 10Hashar: (DO NOT SUBMIT) test CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140247 [21:30:55] so zuul-merger seems happy hopefully [21:31:59] (03Abandoned) 10Hashar: (DO NOT SUBMIT) test CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140247 (owner: 10Hashar) [21:36:23] (03PS1) 10Dzahn: gerrit: remove gerrit2002 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 [21:37:03] (03CR) 10Thcipriani: [C:03+1] gerrit: remove gerrit2002 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (owner: 10Dzahn) [21:37:50] (03PS2) 10Dzahn: gerrit: remove gerrit2002 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (https://phabricator.wikimedia.org/T236114) [21:38:55] (03PS3) 10Dzahn: gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (https://phabricator.wikimedia.org/T236114) [21:39:11] (03CR) 10Dzahn: [C:03+2] gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (https://phabricator.wikimedia.org/T236114) (owner: 10Dzahn) [21:39:17] (03CR) 10Cwhite: [C:03+2] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite) [21:40:05] cwhite: we have a merge conflict [21:40:09] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100 [21:40:15] T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100 [21:40:27] uh oh [21:40:52] can you merge both? [21:40:57] or just yours? either is fine [21:41:06] mine just completed [21:41:15] cool, I see it. thanks [21:41:24] <3 [21:46:57] PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [21:47:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [21:48:59] RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit [21:51:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:52:08] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:52:19] (03CR) 10Umherirrender: "recheck after gerrit failover" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey) [21:52:24] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:52:30] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:52:35] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:52:46] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:53:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:53:07] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:53:31] RESOLVED: [2x] ProbeDown: Service gerrit2002:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit2002:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:53:42] FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:57:03] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:08] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:19] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:30] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:35] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:46] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:57:57] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:58:07] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:58:17] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T2200) [22:01:58] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:02:09] RESOLVED: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:18:46] (03CR) 10BCornwall: [C:03+2] Temporarily remove lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1140219 (owner: 10Lucas Werkmeister (WMDE)) [22:18:49] (03PS1) 10Dzahn: Revert "gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1140251 [22:35:27] FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:12] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T393066 (10SCampos-WMF) 03NEW [23:16:25] RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:43] FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:41:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140260 [23:41:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140260 (owner: 10TrainBranchBot) [23:51:49] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate mostlinked job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140214 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan) [23:52:28] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140260 (owner: 10TrainBranchBot) [23:53:14] (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all remaining general updatequerypages jobs [puppet] - 10https://gerrit.wikimedia.org/r/1140216 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)