[00:01:21] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance
[00:03:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:03:42] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[00:05:39] <logmsgbot>	 !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139577|enwiki and commons: Increase revision-slots cache expiry again (T183490)]] (duration: 13m 45s)
[00:05:44] <stashbot>	 T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490
[00:06:31] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2241.codfw.wmnet with reason: Maintenance
[00:06:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db[2242-2243].codfw.wmnet with reason: Maintenance
[00:10:24] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 614.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[00:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[01:03:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:03:42] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[01:04:37] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968 (10phaultfinder) 03NEW
[01:19:37] <wikibugs>	 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T392968#10778723 (10phaultfinder)
[01:22:46] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[01:50:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:52:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[01:58:42] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:59:20] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:00:16] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:02:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:03:50] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:08:24] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[02:27:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:29:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:40:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:41:50] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:49:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[02:51:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:02:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web-next_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:03:50] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:18:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[03:35:20] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:36:16] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[03:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:59:56] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[04:14:31] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw:expansion: Console/management  wiring - https://phabricator.wikimedia.org/T382383#10778823 (10Papaul) 05Open→03Resolved This is complete
[04:15:03] <wikibugs>	 (03PS1) 10Papaul: Add new PDU's in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504)
[04:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[04:31:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:32:52] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:47:20] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[04:48:16] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[05:03:16] <icinga-wm>	 PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 129, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:03:44] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:04:51] <jinxer-wm>	 FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:06:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:09:25] <wikibugs>	 (03PS2) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833)
[05:09:25] <wikibugs>	 (03CR) 10Arnaudb: "I had to rebase locally due to merge conflicts, lmk if you spot anything weird" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[05:12:56] <wikibugs>	 (03CR) 10Arnaudb: "nitpick comment added" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[05:22:46] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[05:25:03] <jinxer-wm>	 FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[05:27:04] <wikibugs>	 (03PS1) 10Arnaudb: gerrit: failover bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666)
[05:27:04] <wikibugs>	 (03CR) 10Arnaudb: "Prepping for today's switchover I stumbled upon this error" [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[05:29:44] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:30:03] <jinxer-wm>	 RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[05:30:16] <icinga-wm>	 RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 130, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:34:51] <jinxer-wm>	 RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down  - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown
[05:49:11] <kart_>	 Deploying MinT on the staging.
[05:51:41] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[05:58:22] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10778904 (10VRiley-WMF) Created a Dell service request for this Service Request 209252181.
[05:58:42] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:59:28] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Switchover m1-master [dns] - 10https://gerrit.wikimedia.org/r/1139983 (https://phabricator.wikimedia.org/T392806)
[06:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0600)
[06:00:44] <marostegui>	 !log Failover m1-master T392806
[06:00:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:01:09] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Switchover m1-master [dns] - 10https://gerrit.wikimedia.org/r/1139983 (https://phabricator.wikimedia.org/T392806) (owner: 10Marostegui)
[06:01:13] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[06:01:27] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] Fastnetmon: bump threshold_pps to 1.75M [puppet] - 10https://gerrit.wikimedia.org/r/1139503 (owner: 10Ayounsi)
[06:01:42] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:03:43] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[06:11:48] <logmsgbot>	 !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:13:33] <XioNoX>	 !log magru: remove novaacore/momentum
[06:13:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:14:11] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) (owner: 10Ayounsi)
[06:14:45] <wikibugs>	 (03Merged) 10jenkins-bot: magru: remove novaacore/momentum [homer/public] - 10https://gerrit.wikimedia.org/r/1136152 (https://phabricator.wikimedia.org/T381913) (owner: 10Ayounsi)
[06:15:06] <wikibugs>	 (03PS1) 10KartikMistry: cxserver: Use URL instead of mw.Uri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241)
[06:28:49] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10778959 (10VRiley-WMF)
[06:31:14] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10778961 (10VRiley-WMF)
[06:31:30] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 13Patch-For-Review: decommission mw13[49-51], mw1407 - https://phabricator.wikimedia.org/T383226#10778962 (10VRiley-WMF) 05Open→03Resolved
[06:32:33] <hashar>	 jouncebot: refresh
[06:32:33] <jouncebot>	 I refreshed my knowledge about deployments.
[06:32:38] <hashar>	 jouncebot: nowandnext
[06:32:38] <jouncebot>	 For the next 0 hour(s) and 27 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0600)
[06:32:38] <jouncebot>	 In 0 hour(s) and 27 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0700)
[06:33:00] <hashar>	 bots driven development
[06:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:36:07] <hashar>	 triaging is fun 
[06:36:07] <hashar>	 [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'testwiki.wikilambda_zobject_function_join' doesn't exist Function: MediaWiki\Extension\WikiLambda\ZObjectStore::findFirstZImplementationFunction Query: SELECT wlzf_zfunction_zid
[06:36:08] <hashar>	 :)
[06:37:30] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10778980 (10wiki_willy) @VRiley-WMF & @Jclark-ctr - can you grab a spare from one of the decom'd servers for this?  >>! In T392751#10770238, @Marostegui wrote...
[06:41:24] <wikibugs>	 (03PS1) 10Slyngshede: P:idp Default OIDC services to FLAT profile [puppet] - 10https://gerrit.wikimedia.org/r/1140074
[06:42:20] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on cloudcephmon1004 - https://phabricator.wikimedia.org/T392424#10778984 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF
[06:42:26] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 30 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry)
[06:46:59] <wikibugs>	 (03CR) 10Nikerabbit: cxserver: Use URL instead of mw.Uri (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry)
[06:47:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: Update Knative on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139865 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[06:48:38] <wikibugs>	 (03CR) 10KartikMistry: cxserver: Use URL instead of mw.Uri (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry)
[06:49:15] <wikibugs>	 (03PS2) 10KartikMistry: ContentTranslation: Add protocol to cxserver URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241)
[06:51:34] <wikibugs>	 (03PS1) 10Jelto: gerrit: add more IP ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1140075 (https://phabricator.wikimedia.org/T392467)
[06:56:54] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[06:58:40] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[06:59:27] <wikibugs>	 (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5407/console" [puppet] - 10https://gerrit.wikimedia.org/r/1140074 (owner: 10Slyngshede)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0700). Please do the needful.
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:57] <wikibugs>	 (03PS1) 10Elukey: Revert^2 "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140078
[07:01:18] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:02:20] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:02:35] <kart_>	 I'll go ahead with my config patch..
[07:04:01] <wikibugs>	 (03Abandoned) 10Slyngshede: P:idp Default OIDC services to FLAT profile [puppet] - 10https://gerrit.wikimedia.org/r/1140074 (owner: 10Slyngshede)
[07:04:04] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10779031 (10VRiley-WMF)
[07:05:06] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe1003 - https://phabricator.wikimedia.org/T389632#10779033 (10VRiley-WMF) apus-fe1003 Racked and added into netbox  C2 U14
[07:05:47] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry)
[07:06:35] <wikibugs>	 (03Merged) 10jenkins-bot: ContentTranslation: Add protocol to cxserver URL [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139986 (https://phabricator.wikimedia.org/T390241) (owner: 10KartikMistry)
[07:07:38] <logmsgbot>	 !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1139986|ContentTranslation: Add protocol to cxserver URL (T390241)]]
[07:07:43] <stashbot>	 T390241: [Request] mediawiki.Uri is deprecated, use URL instead in ContentTranslation - https://phabricator.wikimedia.org/T390241
[07:07:55] <wikibugs>	 (03PS1) 10Slyngshede: Permission management: Add pagination to log [software/bitu] - 10https://gerrit.wikimedia.org/r/1140080
[07:08:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the patch, production has switched to Prometheus alerting for Puppet runs. The file was likely left behind as an oversight a" [puppet] - 10https://gerrit.wikimedia.org/r/1139930 (https://phabricator.wikimedia.org/T392961) (owner: 10Dwisehaupt)
[07:08:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1033 and es2033 to es2 masters T391921', diff saved to https://phabricator.wikimedia.org/P75674 and previous config saved to /var/cache/conftool/dbconfig/20250430-070853-marostegui.json
[07:08:59] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[07:09:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2031 es1026 T391921', diff saved to https://phabricator.wikimedia.org/P75675 and previous config saved to /var/cache/conftool/dbconfig/20250430-070937-marostegui.json
[07:09:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] "+Tiziano since he's working on PDUs too JFYI. LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul)
[07:10:20] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:10:45] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es1026.eqiad.wmnet
[07:11:05] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es1026 - Upgrading es1026.eqiad.wmnet
[07:11:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2031.codfw.wmnet
[07:11:12] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1026 - Upgrading es1026.eqiad.wmnet
[07:11:16] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:11:28] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+1] Revert^2 "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140078 (owner: 10Elukey)
[07:11:29] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2031 - Upgrading es2031.codfw.wmnet
[07:11:36] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2031 - Upgrading es2031.codfw.wmnet
[07:12:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe100[5-7] - https://phabricator.wikimedia.org/T389635#10779061 (10VRiley-WMF) Added devices into netbox. Need to plan for rack placment.
[07:12:14] <wikibugs>	 (03PS1) 10Marostegui: es1026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140082 (https://phabricator.wikimedia.org/T391921)
[07:12:52] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update es2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1140083 (https://phabricator.wikimedia.org/T391921)
[07:13:34] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2031 - Upgrading es2031.codfw.wmnet
[07:13:41] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es2031 - Upgrading es2031.codfw.wmnet
[07:14:12] <logmsgbot>	 marostegui@cumin1002 upgrade (PID 519111) is awaiting input
[07:14:35] <logmsgbot>	 !log kartik@deploy1003 kartik: Backport for [[gerrit:1139986|ContentTranslation: Add protocol to cxserver URL (T390241)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:14:39] <wikibugs>	 (03CR) 10Elukey: [C:03+2] Revert^2 "ml-services: add seccomp profile to editquality-reverted in codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140078 (owner: 10Elukey)
[07:14:39] <stashbot>	 T390241: [Request] mediawiki.Uri is deprecated, use URL instead in ContentTranslation - https://phabricator.wikimedia.org/T390241
[07:15:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es1026 - Upgrading es1026.eqiad.wmnet
[07:15:09] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1026 - Upgrading es1026.eqiad.wmnet
[07:15:36] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es1026.eqiad.wmnet
[07:16:27] <logmsgbot>	 !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es2031.codfw.wmnet
[07:16:48] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10779083 (10VRiley-WMF) a:03VRiley-WMF
[07:17:29] <logmsgbot>	 !log kartik@deploy1003 kartik: Continuing with sync
[07:18:03] <jinxer-wm>	 FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[07:18:22] <godog>	 yes yes
[07:18:24] <godog>	 !incidents
[07:18:24] <sirenbot>	 6070 (UNACKED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw)
[07:18:25] <sirenbot>	 6069 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad)
[07:18:25] <sirenbot>	 6068 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[07:18:29] <godog>	 !ack 6070
[07:18:29] <sirenbot>	 6070 (ACKED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw)
[07:18:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[07:18:44] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2031.codfw.wmnet with reason: Maintenance
[07:18:51] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1026.eqiad.wmnet with reason: Maintenance
[07:19:15] <godog>	 there was an earlier page about another sessionstore host, which then recovered
[07:19:53] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es2031.codfw.wmnet with reason: Maintenance
[07:20:14] <godog>	 not sure what the spikes up are in utilization, I'd guess compactions though, cc urandom 
[07:20:19] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[07:20:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] wmnet: Update es2-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1140083 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[07:21:58] <logmsgbot>	 !log marostegui@dns1006 START - running authdns-update
[07:23:03] <jinxer-wm>	 RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[07:23:58] <logmsgbot>	 !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1139986|ContentTranslation: Add protocol to cxserver URL (T390241)]] (duration: 16m 19s)
[07:24:03] <stashbot>	 T390241: [Request] mediawiki.Uri is deprecated, use URL instead in ContentTranslation - https://phabricator.wikimedia.org/T390241
[07:24:29] <logmsgbot>	 !log marostegui@dns1006 END - running authdns-update
[07:24:38] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es1026: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140082 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[07:26:35] <wikibugs>	 (03PS1) 10Marostegui: es2031: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140085 (https://phabricator.wikimedia.org/T391921)
[07:27:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] es2031: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1140085 (https://phabricator.wikimedia.org/T391921) (owner: 10Marostegui)
[07:27:38] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[07:27:40] <wikibugs>	 (03PS1) 10Elukey: admin_ng: allow to set seccomp for Knative-based pods on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140086 (https://phabricator.wikimedia.org/T369493)
[07:29:05] <marostegui>	 !log Finished migrating es2 to MariaDB 10.11 T391921
[07:29:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:10] <stashbot>	 T391921: Migrate read only external store to MariaDB 10.11 - https://phabricator.wikimedia.org/T391921
[07:29:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75676 and previous config saved to /var/cache/conftool/dbconfig/20250430-072956-root.json
[07:30:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P75677 and previous config saved to /var/cache/conftool/dbconfig/20250430-073009-root.json
[07:34:21] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: allow to set seccomp for Knative-based pods on ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140086 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[07:35:18] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[07:35:38] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[07:36:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+1] pdu_config_netbox: also fetch older PDUs from netbox [puppet] - 10https://gerrit.wikimedia.org/r/1135022 (https://phabricator.wikimedia.org/T387866) (owner: 10Tiziano Fogli)
[07:37:05] <wikibugs>	 (03PS2) 10Klausman: thanos/swift: at pseudo secrets for mint_ro [labs/private] - 10https://gerrit.wikimedia.org/r/1140112
[07:43:05] <wikibugs>	 (03PS8) 10BCornwall: varnish: Replace X-IS-ALT-DOMAIN with variable [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550)
[07:45:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75679 and previous config saved to /var/cache/conftool/dbconfig/20250430-074502-root.json
[07:45:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P75680 and previous config saved to /var/cache/conftool/dbconfig/20250430-074515-root.json
[07:47:44] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet
[07:47:49] <wikibugs>	 (03PS1) 10Klausman: thanos/swift: add user for Mint, with r/o access [puppet] - 10https://gerrit.wikimedia.org/r/1140118
[07:47:49] <wikibugs>	 (03CR) 10Klausman: [V:03+1] "The changes to the pseudo-private and actual-private repos have already been merged." [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (owner: 10Klausman)
[07:47:59] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host ms-be1088.eqiad.wmnet
[07:48:03] <wikibugs>	 (03PS2) 10Klausman: thanos/swift: add user for Mint, with r/o access [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958)
[07:48:16] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[07:48:23] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[07:48:38] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-be1088.eqiad.wmnet
[07:49:00] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[07:49:06] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[07:50:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis)
[07:51:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] "Diff looks good" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[07:53:42] <jinxer-wm>	 FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:54:42] <wikibugs>	 (03PS1) 10Elukey: ml-services: enable seccomp defaults for ml-serve-codfw's isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140120 (https://phabricator.wikimedia.org/T369493)
[07:55:28] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:55:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:57:18] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53799 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:57:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.192 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:58:47] <logmsgbot>	 !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2230.codfw.wmnet,db1176.eqiad.wmnet with reason: Maintenance
[08:00:05] <jouncebot>	 hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800)
[08:00:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75681 and previous config saved to /var/cache/conftool/dbconfig/20250430-080007-root.json
[08:00:11] <logmsgbot>	 !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host ms-be1088.eqiad.wmnet
[08:00:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P75682 and previous config saved to /var/cache/conftool/dbconfig/20250430-080021-root.json
[08:02:54] <hashar>	 good morning, this is Antoine your train conductor for the day.  It is sunny outside with no errors, we will scap take off in a short time, please fasten your seat belts and watch your favorite bugs
[08:03:03] <wikibugs>	 (03PS1) 10MVernon: Swift: mark ms-be1060 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1140121 (https://phabricator.wikimedia.org/T392796)
[08:04:09] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140122 (https://phabricator.wikimedia.org/T386222)
[08:04:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140122 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[08:04:27] <wikibugs>	 (03CR) 10Elukey: [C:03+1] "LGTM, but please wait for the final sign-off from Data Persistence (added Matthew to the change)." [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) (owner: 10Klausman)
[08:04:57] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] Swift: mark ms-be1060 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1140121 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon)
[08:04:59] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140122 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[08:05:41] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Swift: mark ms-be1060 as failed [puppet] - 10https://gerrit.wikimedia.org/r/1140121 (https://phabricator.wikimedia.org/T392796) (owner: 10MVernon)
[08:09:04] <wikibugs>	 (03CR) 10Cyndywikime: Growth-Beta: Configure higher Impact Module edit limits for pilot wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1136986 (https://phabricator.wikimedia.org/T341599) (owner: 10Cyndywikime)
[08:10:46] <wikibugs>	 10ops-eqiad, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, and 2 others: Degraded RAID on db1171 - https://phabricator.wikimedia.org/T392751#10779215 (10VRiley-WMF) 05Open→03Resolved Swapped out the drive. Checked in with @Marostegui everything seems to be good. Closing this out.
[08:13:26] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5410/console" [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:14:24] <wikibugs>	 (03PS1) 10Majavah: dynamicproxy: Use lua-resty-redis from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175)
[08:14:30] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: use read-only object storage credentials on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1139005 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto)
[08:14:34] <taavi>	 hashar: seat belts? on trains?
[08:15:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75683 and previous config saved to /var/cache/conftool/dbconfig/20250430-081512-root.json
[08:15:23] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] k8s: rename V1beta1Eviction to support future upgrades [software/spicerack] - 10https://gerrit.wikimedia.org/r/1139851 (https://phabricator.wikimedia.org/T390857) (owner: 10Elukey)
[08:15:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 30%: Repooling', diff saved to https://phabricator.wikimedia.org/P75684 and previous config saved to /var/cache/conftool/dbconfig/20250430-081526-root.json
[08:17:05] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[08:17:57] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.44.0-wmf.27  refs T386222
[08:18:02] <stashbot>	 T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222
[08:18:31] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5411/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[08:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[08:18:46] <hashar>	 taavi: we are in QA, safety first!! :b
[08:19:29] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: Use lua-resty-redis from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175)
[08:19:38] <hashar>	 MediaWiki\Mail\RecentChangeMailComposer::__construct(): Argument #6 ($timestamp) must be of type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.25/includes/mail/EmailNotification.php on line 222
[08:19:39] <hashar>	 oh yeah
[08:19:46] <hashar>	 so mails are not sent
[08:19:52] <hashar>	 (I imagine)
[08:20:51] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5412/co" [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[08:21:09] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140126 (https://phabricator.wikimedia.org/T386222)
[08:21:12] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140126 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[08:21:30] <hashar>	 I am rolling back, MediaWiki does not send recent changes email notifications anymore
[08:22:08] <wikibugs>	 (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140126 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[08:24:41] <wikibugs>	 (03PS1) 10Brouberol: airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987)
[08:26:55] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] dynamicproxy: Use lua-resty-redis from Debian package [puppet] - 10https://gerrit.wikimedia.org/r/1140125 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[08:27:13] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10779309 (10MatthewVernon) Hi,  it's crashed again, after about an hour as far as I can tell (23:13:14 UTC)....
[08:27:41] <godog>	 filed the sessionstore pages as T392989
[08:27:42] <stashbot>	 T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989
[08:28:21] <wikibugs>	 (03CR) 10Anzx: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[08:28:31] <godog>	 jouncebot: now and next
[08:28:31] <jouncebot>	 For the next 1 hour(s) and 31 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800)
[08:29:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] thanos: enable auto memlimit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139852 (https://phabricator.wikimedia.org/T383966) (owner: 10Filippo Giunchedi)
[08:29:50] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: failover bugfix [cookbooks] - 10https://gerrit.wikimedia.org/r/1139977 (https://phabricator.wikimedia.org/T260666) (owner: 10Arnaudb)
[08:29:57] <hashar>	 !log Rolled back MediaWiki train from group 1 to group 0 due to T392988  # T386222
[08:30:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:02] <stashbot>	 T392988: TypeError: MediaWiki\Mail\RecentChangeMailComposer::__construct(): Argument #6 ($timestamp) must be of type string, null given, called in /srv/mediawiki/php-1.44.0-wmf.25/includes/mail/EmailNotification.php on line 222 - https://phabricator.wikimedia.org/T392988
[08:30:03] <stashbot>	 T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222
[08:30:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75686 and previous config saved to /var/cache/conftool/dbconfig/20250430-083017-root.json
[08:30:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P75687 and previous config saved to /var/cache/conftool/dbconfig/20250430-083032-root.json
[08:33:11] <Emperor>	 !log ms-be1060 T392796 /usr/local/bin/swift_ring_manager -o /var/cache/swift_rings --doit --skip-dispersion-check --skip-replication-check --immediate-only -v 
[08:33:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:33:16] <stashbot>	 T392796: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796
[08:33:42] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:34:32] <wikibugs>	 (03CR) 10MVernon: [C:03+2] Swift: drain ms-be2080 (prep for VLAN move) [puppet] - 10https://gerrit.wikimedia.org/r/1138830 (https://phabricator.wikimedia.org/T354872) (owner: 10MVernon)
[08:35:24] <logmsgbot>	 !log hashar@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.27  refs T386222
[08:35:29] <stashbot>	 T386222: 1.44.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T386222
[08:37:42] <wikibugs>	 (03PS1) 10MVernon: swift: remove ms-be1060 entirely [puppet] - 10https://gerrit.wikimedia.org/r/1140130 (https://phabricator.wikimedia.org/T392796)
[08:41:20] <wikibugs>	 (03PS16) 10Majavah: dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175)
[08:41:48] <wikibugs>	 (03CR) 10Klausman: [C:03+1] ml-services: enable seccomp defaults for ml-serve-codfw's isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140120 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[08:45:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75688 and previous config saved to /var/cache/conftool/dbconfig/20250430-084523-root.json
[08:45:37] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P75689 and previous config saved to /var/cache/conftool/dbconfig/20250430-084537-root.json
[08:48:02] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[08:50:24] <wikibugs>	 (03CR) 10Majavah: [C:03+2] dynamicproxy: Provision AAAA records [puppet] - 10https://gerrit.wikimedia.org/r/1088338 (https://phabricator.wikimedia.org/T379175) (owner: 10Majavah)
[08:50:34] <wikibugs>	 (03PS2) 10Anzx: mswikisource: add Karya and Gerbang namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984)
[08:51:03] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx)
[08:53:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[08:54:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org
[08:54:30] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10779403 (10tappof) Hi @Madalina, While I was checking, I noticed that you've already been added to the group.  ` root@...:~#...
[08:56:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10779409 (10tappof)
[08:57:25] <wikibugs>	 (03CR) 10Btullis: [C:03+1] airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) (owner: 10Brouberol)
[08:59:34] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3548900) is awaiting input
[09:00:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org
[09:00:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75690 and previous config saved to /var/cache/conftool/dbconfig/20250430-090028-root.json
[09:00:30] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) (owner: 10Brouberol)
[09:00:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P75691 and previous config saved to /var/cache/conftool/dbconfig/20250430-090041-root.json
[09:01:27] <Dreamy_Jazz>	 jouncebot: nowandnext
[09:01:27] <jouncebot>	 For the next 0 hour(s) and 58 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800)
[09:01:27] <jouncebot>	 In 0 hour(s) and 58 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000)
[09:01:49] <Dreamy_Jazz>	 hashar: Any opposition for me to deploy a security patch now?
[09:02:21] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] airflow-test-k8s: double the request/limit memory of the webserver [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140128 (https://phabricator.wikimedia.org/T392987) (owner: 10Brouberol)
[09:02:52] <moritzm>	 FYI, aux-k8s-etcd2004 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees)
[09:02:59] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet
[09:03:58] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs1013.eqiad.wmnet} and A:liberica
[09:04:18] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs1013.eqiad.wmnet} and A:liberica
[09:05:28] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[09:07:10] <icinga-wm>	 PROBLEM - BGP status on lsw1-e1-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64600/IPv4: Active - PyBal https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:09:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet
[09:10:09] <logmsgbot>	 !log vgutierrez@cumin1002 START - Cookbook sre.hosts.reboot-single for host lvs1013.eqiad.wmnet
[09:10:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[09:10:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast7001.wikimedia.org
[09:10:30] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.69 ms
[09:12:36] <logmsgbot>	 !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lvs1013.eqiad.wmnet
[09:13:10] <icinga-wm>	 RECOVERY - BGP status on lsw1-e1-eqiad.mgmt is OK: BGP OK - up: 9, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[09:13:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2019:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:15:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75692 and previous config saved to /var/cache/conftool/dbconfig/20250430-091534-root.json
[09:15:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P75693 and previous config saved to /var/cache/conftool/dbconfig/20250430-091547-root.json
[09:16:27] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast7001.wikimedia.org
[09:16:31] <wikibugs>	 (03PS2) 10Majavah: dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453)
[09:16:33] <wikibugs>	 (03CR) 10Elukey: [C:03+2] ml-services: enable seccomp defaults for ml-serve-codfw's isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140120 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey)
[09:17:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[09:17:36] <elukey>	 !log manual restart of the waterline service on maps1009
[09:17:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:18:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453) (owner: 10Majavah)
[09:18:20] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:19:29] <wikibugs>	 (03PS3) 10Majavah: dynamicproxy: expose api on port 443 [puppet] - 10https://gerrit.wikimedia.org/r/781950 (https://phabricator.wikimedia.org/T305453)
[09:22:45] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:22:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[09:23:59] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3573657) is awaiting input
[09:24:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
[09:24:44] <wikibugs>	 (03PS1) 10Filippo Giunchedi: thanos: move to native trace sampling 0.1% [puppet] - 10https://gerrit.wikimedia.org/r/1140135 (https://phabricator.wikimedia.org/T392994)
[09:25:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10779532 (10fnegri)
[09:26:33] <wikibugs>	 (03PS1) 10Brouberol: dse-k8s-eqiad: substantially increase job sidecar controller CPU resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140136 (https://phabricator.wikimedia.org/T392995)
[09:26:43] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:27:56] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:28:44] <godog>	 !log bounce prometheus-statsd-exporter on stat1011 - T389344
[09:28:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:48] <stashbot>	 T389344: analytics/wmde/scripts Graphite to Prometheus migration - https://phabricator.wikimedia.org/T389344
[09:28:55] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:29:48] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:30:29] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' .
[09:30:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es1026 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75694 and previous config saved to /var/cache/conftool/dbconfig/20250430-093040-root.json
[09:30:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'es2031 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P75695 and previous config saved to /var/cache/conftool/dbconfig/20250430-093053-root.json
[09:31:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
[09:31:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[09:31:48] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' .
[09:32:32] <wikibugs>	 (03CR) 10Klausman: [V:03+2 C:03+2] admin/data.yaml: Add dr0ptp4kt (Adam Baso) to users of ml-lab100x [puppet] - 10https://gerrit.wikimedia.org/r/1139779 (owner: 10Klausman)
[09:32:54] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'llm' for release 'main' .
[09:33:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2020:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:34:44] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'logo-detection' for release 'main' .
[09:35:09] <logmsgbot>	 !log klausman@cumin1002 START - Cookbook sre.hosts.reboot-single for host ml-lab1002.eqiad.wmnet
[09:35:53] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10779603 (10fnegri) 05Resolved→03Open Reopening as unfortunately the alert is still flapping. It looks like the whole rack's temperatu...
[09:35:53] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:36:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'readability' for release 'main' .
[09:36:34] <elukey>	 the high latency for mlserve is me, I am deploying a lot of services, going to pause for a sec
[09:38:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet
[09:38:10] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' .
[09:38:47] <logmsgbot>	 !log elukey@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revision-models' for release 'main' .
[09:40:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] dse-k8s-eqiad: substantially increase job sidecar controller CPU resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140136 (https://phabricator.wikimedia.org/T392995) (owner: 10Brouberol)
[09:41:16] <wikibugs>	 (03CR) 10Superpes15: "It seems that you didn't run tox as indicated on logos/README.md! Did you follow the steps provided??" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[09:41:54] <logmsgbot>	 !log klausman@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-lab1002.eqiad.wmnet
[09:41:58] <wikibugs>	 (03PS1) 10Elukey: admin_ng: enable Knative's secure-pod-defaults for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140140 (https://phabricator.wikimedia.org/T369493)
[09:42:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add krb1002 to the list of KDCs presented to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1139850 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[09:42:58] <wikibugs>	 (03CR) 10Superpes15: [C:04-1] "Please follow logos/README.md when you try to change a logo (you need to use tox)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[09:44:14] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "SGTM. If I understand correctly how contactgroups work, this will only affect Icinga alerts? For example we currently have a flapping aler" [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis)
[09:44:58] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto)
[09:45:00] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3595577) is awaiting input
[09:45:10] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] gerrit: add more IP ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1140075 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[09:45:21] <wikibugs>	 (03CR) 10Federico Ceratto: [V:03+2 C:03+2] sre.mysql.clone: speed up pooling in [cookbooks] - 10https://gerrit.wikimedia.org/r/1139799 (https://phabricator.wikimedia.org/T392883) (owner: 10Federico Ceratto)
[09:45:49] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10779631 (10Jelto) I switched the replica to use the read-only credentials but unfortunately I get a `AccessDenied` error when acce...
[09:45:53] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlserve&var-latency_percentile=0.95&var-verb=PATCH - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[09:46:28] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gerrit: add more IP ranges to gerrit_abusers [puppet] - 10https://gerrit.wikimedia.org/r/1140075 (https://phabricator.wikimedia.org/T392467) (owner: 10Jelto)
[09:48:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet
[09:50:26] <wikibugs>	 (03CR) 10Vgutierrez: varnish: Replace X-IS-ALT-DOMAIN with variable (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1068085 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall)
[09:51:51] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10779640 (10Stevemunene)
[09:52:34] <kostajh>	 !jouncebot nowandnext
[09:52:35] <wm-bot>	 a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot
[09:52:48] <kostajh>	 jouncebot: nowandnext
[09:52:48] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): MediaWiki train - Utc-0+Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T0800)
[09:52:48] <jouncebot>	 In 0 hour(s) and 7 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000)
[09:53:09] <kostajh>	 hashar: can I deploy a security patch now? 
[09:54:45] <hashar>	 kostajh: yes sure
[09:54:50] <hashar>	 do note I have rolled back the train this morning
[09:54:59] <hashar>	 https://versions.toolforge.org/
[09:55:22] <hashar>	 so we are mostly still on wmf.25. After lunch I will revisit the blocker and see whether it might have been a red hearing
[09:55:25] <hashar>	 I am off for lunch:
[09:55:26] <hashar>	 !
[09:55:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet
[09:55:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet
[09:55:56] <hashar>	 so if you need assistance, we can do it this afternoon :)
[09:58:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2021:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:58:47] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000)
[10:01:03] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:toolforge: disable_tool: Don't log diffs with secrets [puppet] - 10https://gerrit.wikimedia.org/r/1139443 (owner: 10Majavah)
[10:01:26] <wikibugs>	 (03PS1) 10Muehlenhoff: Stop passing krb2002 to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1140142 (https://phabricator.wikimedia.org/T390863)
[10:01:28] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch krb2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1140143 (https://phabricator.wikimedia.org/T390863)
[10:01:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2022.codfw.wmnet
[10:04:16] <kostajh>	 hopefully no assistance needed :) 
[10:04:20] <kostajh>	 I'm starting the deploy now 
[10:04:34] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French)
[10:04:38] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "The alertmanager team routing is currently at https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hie" [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis)
[10:04:55] <wikibugs>	 (03PS2) 10Federico Ceratto: sre.mysql.pool: remove connection count [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981)
[10:04:55] <wikibugs>	 (03CR) 10Federico Ceratto: "A small cleanup." [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto)
[10:06:22] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:06:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2022.codfw.wmnet
[10:07:18] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:13:00] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:toolforge: disable_tool: Don't log diffs with secrets [puppet] - 10https://gerrit.wikimedia.org/r/1139443 (owner: 10Majavah)
[10:13:16] <wikibugs>	 (03PS1) 10Effie Mouzeli: admin: move jiji to ops-limited Bug: T392998 [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998)
[10:13:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2022.codfw.wmnet
[10:13:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2022:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:13:54] <wikibugs>	 (03PS2) 10Effie Mouzeli: admin: move jiji to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998)
[10:13:56] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: move jiji to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) (owner: 10Effie Mouzeli)
[10:13:56] <kostajh>	 sycning now 
[10:14:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2022.codfw.wmnet
[10:15:56] <logmsgbot>	 !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.add-wiki for database nupwiki (T390714)
[10:16:00] <stashbot>	 T390714: [wikireplicas] Create views for new wiki nupwiki - https://phabricator.wikimedia.org/T390714
[10:16:06] <logmsgbot>	 !log fnegri@cumin1002 END (FAIL) - Cookbook sre.wikireplicas.add-wiki (exit_code=99) for database nupwiki (T390714)
[10:16:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2030.codfw.wmnet
[10:17:44] <wikibugs>	 (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) (owner: 10Effie Mouzeli)
[10:19:24] <icinga-wm>	 RECOVERY - MegaRAID on db1171 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring
[10:23:47] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2030.codfw.wmnet
[10:24:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2031.codfw.wmnet
[10:32:33] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw:maintenance:updatequerypages: move all deadendpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139432 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[10:32:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2031.codfw.wmnet
[10:33:12] <hnowlan>	 jouncebot: nowandnext
[10:33:13] <jouncebot>	 For the next 0 hour(s) and 26 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1000)
[10:33:13] <jouncebot>	 In 0 hour(s) and 26 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1100)
[10:33:22] <moritzm>	 !log installing curl security updates
[10:33:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:36:41] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10779759 (10MatthewVernon) I think I found the relevant request - was this about 08:33 UTC today (and then 09:07 and 09:27)?  ` Apr...
[10:37:09] <kostajh>	 hnowlan: I'm deploying a security patch 
[10:37:45] <kostajh>	 hashar: our patch had an issue, so we're making an update to it, and will sync that. Then we'll sync another patch to wmf.27. 
[10:39:32] <hnowlan>	 kostajh: ack, I don't have any conflicts
[10:39:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2031.codfw.wmnet
[10:40:07] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2031.codfw.wmnet
[10:40:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2033.codfw.wmnet
[10:40:17] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2033.codfw.wmnet
[10:40:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:41:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2035.codfw.wmnet
[10:41:48] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[10:43:42] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service ganeti2031:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:45:35] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3659261) is awaiting input
[10:45:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2035.codfw.wmnet
[10:46:40] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] dse-k8s-eqiad: substantially increase job sidecar controller CPU resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140136 (https://phabricator.wikimedia.org/T392995) (owner: 10Brouberol)
[10:46:53] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[10:47:01] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[10:48:20] <XioNoX>	 !log remove cloudcontrol1005 (decom) from eqiad/codfw core routers 
[10:48:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:49:33] <wikibugs>	 (03CR) 10MVernon: [C:04-1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) (owner: 10Klausman)
[10:51:05] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[10:51:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2035.codfw.wmnet
[10:51:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2035.codfw.wmnet
[10:51:25] <logmsgbot>	 !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[10:51:29] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534)
[10:51:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2036.codfw.wmnet
[10:52:01] <wikibugs>	 (03CR) 10MVernon: [C:04-1] "said change is https://gerrit.wikimedia.org/r/c/labs/private/+/1140112 (I'm just noting this here so I can find it later if needed)." [puppet] - 10https://gerrit.wikimedia.org/r/1140118 (https://phabricator.wikimedia.org/T391958) (owner: 10Klausman)
[10:52:31] <wikibugs>	 (03PS1) 10Ayounsi: cr3/4-ulsfo: Set et-0/0/0.0 OSPF metric to 100 [homer/public] - 10https://gerrit.wikimedia.org/r/1140153 (https://phabricator.wikimedia.org/T390731)
[10:52:45] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[10:53:18] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] "Self merging as it should result in a NOOP." [homer/public] - 10https://gerrit.wikimedia.org/r/1140153 (https://phabricator.wikimedia.org/T390731) (owner: 10Ayounsi)
[10:53:58] <wikibugs>	 (03Merged) 10jenkins-bot: cr3/4-ulsfo: Set et-0/0/0.0 OSPF metric to 100 [homer/public] - 10https://gerrit.wikimedia.org/r/1140153 (https://phabricator.wikimedia.org/T390731) (owner: 10Ayounsi)
[10:55:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2036.codfw.wmnet
[10:59:48] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+2] varnish: Add basic edge uniques handling [puppet] - 10https://gerrit.wikimedia.org/r/1136999 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez)
[11:00:04] <jouncebot>	 mvolz: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1100).
[11:01:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2036.codfw.wmnet
[11:01:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2036.codfw.wmnet
[11:03:31] <wikibugs>	 (03CR) 10Jelto: [C:03+2] make helm3 alternative entry dependent on helm [debs/helm3] - 10https://gerrit.wikimedia.org/r/1137010 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[11:04:58] <kostajh>	 syncing the updated patch to wmf.25 
[11:07:39] <wikibugs>	 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#10779851 (10ayounsi) Another tiny improvement would be to only prompt for yes/no when there is only 1 target device.
[11:08:34] <Mvolz>	 jouncebot: nowandnext
[11:08:34] <jouncebot>	 For the next 0 hour(s) and 51 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1100)
[11:08:34] <jouncebot>	 In 1 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1300)
[11:09:50] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate a single updatequerypages_ancientpages shard [puppet] - 10https://gerrit.wikimedia.org/r/1139437 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[11:10:11] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I haven't tested it but looks good to me." [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto)
[11:10:12] <wikibugs>	 (03Abandoned) 10Hnowlan: mw:maintenance: migrate all updatequerypages_ancientpages jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139438 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[11:16:15] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[11:16:22] <wikibugs>	 (03PS1) 10Mvolz: Update zotero package-lock.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140160
[11:16:24] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[11:17:09] <jelto>	 !log "Imported helm317 3.17.0-2  to bullseye-wikimedia and bookworm-wikimedia - T387548"
[11:17:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:17:14] <stashbot>	 T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548
[11:18:16] <wikibugs>	 (03CR) 10Mvolz: [C:03+2] Update zotero package-lock.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140160 (owner: 10Mvolz)
[11:18:21] <kostajh>	 finished with the sync to wmf.25, moving on to wmf.27
[11:18:48] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10779867 (10Stevemunene) The Hosts an-worker116[6-8] are verified with puppet disabled, and the steps followed  ` stevem...
[11:18:58] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025-04-12 - 2025-05-02): Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170#10779868 (10Stevemunene)
[11:19:50] <wikibugs>	 (03Merged) 10jenkins-bot: Update zotero package-lock.json [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140160 (owner: 10Mvolz)
[11:19:57] <wikibugs>	 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Upgrade codfw E/F Juniper equipment to Junos 23.x - https://phabricator.wikimedia.org/T393001 (10ayounsi) 03NEW
[11:20:55] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] START helmfile.d/services/zotero: apply
[11:21:19] <logmsgbot>	 !log mvolz@deploy1003 helmfile [staging] DONE helmfile.d/services/zotero: apply
[11:21:53] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] START helmfile.d/services/zotero: apply
[11:22:22] <logmsgbot>	 !log mvolz@deploy1003 helmfile [codfw] DONE helmfile.d/services/zotero: apply
[11:23:23] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] START helmfile.d/services/zotero: apply
[11:23:52] <logmsgbot>	 !log mvolz@deploy1003 helmfile [eqiad] DONE helmfile.d/services/zotero: apply
[11:26:46] <wikibugs>	 (03PS1) 10Ayounsi: gNMIc start collecting data from pfw [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052)
[11:27:24] <logmsgbot>	 !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device pfw1a-eqiad
[11:29:38] <wikibugs>	 (03PS2) 10Ayounsi: gNMIc start collecting data from pfw [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052)
[11:29:44] <logmsgbot>	 !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device pfw1a-eqiad
[11:30:29] <wikibugs>	 (03PS1) 10Ayounsi: Enable gNMI on pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1140163 (https://phabricator.wikimedia.org/T390052)
[11:31:05] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi)
[11:32:09] <kostajh>	 syncing to wmf.27 now 
[11:34:28] <wikibugs>	 (03Abandoned) 10Ayounsi: Enable gNMI on pfw devices [homer/public] - 10https://gerrit.wikimedia.org/r/1140163 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi)
[11:34:55] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+1] sre.mysql.pool: remove connection count [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto)
[11:34:57] <wikibugs>	 (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.pool: remove connection count [cookbooks] - 10https://gerrit.wikimedia.org/r/1140138 (https://phabricator.wikimedia.org/T392981) (owner: 10Federico Ceratto)
[11:36:11] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2037.codfw.wmnet
[11:37:22] <XioNoX>	 !log enable gnmi on pfw1-eqiad - T390052
[11:37:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:37:27] <stashbot>	 T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052
[11:38:32] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1159.eqiad.wmnet with reason: Maintenance
[11:38:39] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1159 (T392806)', diff saved to https://phabricator.wikimedia.org/P75696 and previous config saved to /var/cache/conftool/dbconfig/20250430-113838-fceratto.json
[11:42:14] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3714991) is awaiting input
[11:43:24] <moritzm>	 FYI, ml-etcd2002 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees)
[11:43:31] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2037.codfw.wmnet
[11:45:10] <kostajh>	 done with wmf.27
[11:45:28] <icinga-wm>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[11:45:57] <kostajh>	 !log Deployed patches for T392976 to wmf.25 and wmf.27
[11:46:00] <wikibugs>	 (03PS1) 10Jelto: helm: remove duplicate alternatives::select entry [puppet] - 10https://gerrit.wikimedia.org/r/1140164 (https://phabricator.wikimedia.org/T387548)
[11:46:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:47:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T392806)', diff saved to https://phabricator.wikimedia.org/P75697 and previous config saved to /var/cache/conftool/dbconfig/20250430-114734-fceratto.json
[11:48:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2037.codfw.wmnet
[11:48:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2037.codfw.wmnet
[11:50:30] <icinga-wm>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms
[11:51:06] <wikibugs>	 (03CR) 10Ayounsi: [C:03+2] gNMIc start collecting data from pfw [puppet] - 10https://gerrit.wikimedia.org/r/1140162 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi)
[11:51:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet
[11:53:14] <kostajh>	 hashar: I'm done with the security patches 
[11:53:52] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:54:50] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:55:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet
[11:58:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet
[12:01:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2038.codfw.wmnet
[12:02:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet
[12:02:31] <jinxer-wm>	 FIRING: Primary outbound port utilisation over 80%  #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[12:02:31] <jinxer-wm>	 FIRING: Primary inbound port utilisation over 80%  #page: Alert for device cr2-eqiad.wikimedia.org - Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[12:02:43] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P75698 and previous config saved to /var/cache/conftool/dbconfig/20250430-120242-fceratto.json
[12:03:40] <godog>	 checking
[12:03:44] <godog>	 !incidents
[12:03:44] <sirenbot>	 6071 (ACKED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet)
[12:03:44] <sirenbot>	 6072 (UNACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[12:03:44] <sirenbot>	 6070 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw)
[12:03:45] <sirenbot>	 6069 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad)
[12:03:45] <sirenbot>	 6068 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[12:03:51] <godog>	 !ack 6072
[12:03:52] <sirenbot>	 6072 (ACKED)  Primary inbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[12:04:30] <XioNoX>	 godog: analytics job
[12:04:37] <godog>	 hah! thank you XioNoX 
[12:04:50] <godog>	 anything actionable atm ?
[12:05:11] <XioNoX>	 godog: pinging the person who ran it and asking them to stop ideally
[12:05:24] <XioNoX>	 with QoS the impact might be lower, checking
[12:05:59] <godog>	 ok I'll look at how to identify analytics jobs
[12:06:42] <XioNoX>	 godog: looks like we're dropping "normal" queue packets, so that's not ideal
[12:07:31] <jinxer-wm>	 RESOLVED: Primary outbound port utilisation over 80%  #page: Device asw2-c-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page
[12:07:31] <jinxer-wm>	 RESOLVED: Primary inbound port utilisation over 80%  #page: Device cr2-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80%  #page   - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page
[12:07:33] <XioNoX>	 it also spiked and went down, so looks good for now
[12:08:22] <godog>	 indeed, might come back I'd guess
[12:08:30] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3740759) is awaiting input
[12:08:32] <godog>	 FWIW what I'm looking at is https://yarn.wikimedia.org/cluster/apps/RUNNING
[12:08:49] <godog>	 it is quite opaque to me tho
[12:13:11] <godog>	 similarly opaque is https://airflow.wikimedia.org
[12:17:28] <icinga-wm>	 PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:44] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:17:49] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P75699 and previous config saved to /var/cache/conftool/dbconfig/20250430-121749-fceratto.json
[12:18:18] <icinga-wm>	 RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53800 bytes in 0.119 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:34] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[12:18:42] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[12:18:44] <XioNoX>	 !log test `host-inbound-traffic system-services any-service` on mr1-ulsfo
[12:18:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:27] <wikibugs>	 (03PS3) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833)
[12:24:10] <wikibugs>	 (03PS1) 10Majavah: admin: Temporarily remove Taavi's access [puppet] - 10https://gerrit.wikimedia.org/r/1140171 (https://phabricator.wikimedia.org/T393000)
[12:24:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet
[12:24:59] <moritzm>	 FYI, aux-k8s-etcd2005 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees)
[12:25:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2038.codfw.wmnet
[12:27:00] <icinga-wm>	 PROBLEM - Host aux-k8s-etcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[12:28:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet
[12:30:28] <icinga-wm>	 RECOVERY - Host aux-k8s-etcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.56 ms
[12:30:36] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2038.codfw.wmnet
[12:30:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2038.codfw.wmnet
[12:32:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T392806)', diff saved to https://phabricator.wikimedia.org/P75700 and previous config saved to /var/cache/conftool/dbconfig/20250430-123255-fceratto.json
[12:33:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1161.eqiad.wmnet with reason: Maintenance
[12:33:21] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[12:33:28] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1161 (T392806)', diff saved to https://phabricator.wikimedia.org/P75701 and previous config saved to /var/cache/conftool/dbconfig/20250430-123327-fceratto.json
[12:36:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet
[12:37:18] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2039.codfw.wmnet
[12:37:55] <wikibugs>	 (03PS1) 10Majavah: common: Temporarily remove some keys [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000)
[12:38:09] <godog>	 jouncebot: now and next
[12:38:10] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 21 minute(s)
[12:38:36] <godog>	 I'll reboot alert2002
[12:38:45] <logmsgbot>	 !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host alert2002.wikimedia.org
[12:38:46] <logmsgbot>	 !log filippo@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2002.wikimedia.org
[12:40:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T392806)', diff saved to https://phabricator.wikimedia.org/P75702 and previous config saved to /var/cache/conftool/dbconfig/20250430-124018-fceratto.json
[12:41:36] <godog>	 yeah reboot-single doesn't work for alert hosts because they are not in icinga
[12:41:46] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet
[12:42:00] <logmsgbot>	 !log filippo@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on alert2002.wikimedia.org with reason: kernel
[12:42:49] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3776728) is awaiting input
[12:43:01] <moritzm>	 FYI, kubestagemaster2004 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees)
[12:43:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2039.codfw.wmnet
[12:43:08] <logmsgbot>	 !log filippo@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on alert2002.wikimedia.org with reason: new kernel
[12:43:31] <godog>	 siiigh ok can't downtime manually even with --force, I'll just do it
[12:43:36] <godog>	 moritzm: ack
[12:43:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet
[12:43:58] <wikibugs>	 (03CR) 10Btullis: [V:03+1 C:03+2] Update clouddumps contactgroups to reflect shared ownership [puppet] - 10https://gerrit.wikimedia.org/r/1139804 (owner: 10Btullis)
[12:44:08] <godog>	 !log reboot alert2002
[12:44:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:45:03] <wikibugs>	 (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Add private values files to resources deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[12:45:32] <icinga-wm>	 PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100%
[12:45:42] <wikibugs>	 (03PS3) 10Btullis: Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472
[12:46:34] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Add private values files to resources deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1139906 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[12:46:53] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me. Thanks." [cookbooks] - 10https://gerrit.wikimedia.org/r/1136836 (owner: 10Volans)
[12:47:07] <wikibugs>	 (03CR) 10Btullis: [V:03+2 C:03+2] Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 (owner: 10Btullis)
[12:47:29] <wikibugs>	 (03Abandoned) 10Btullis: Revert "Add a dummy password for rsyncing mediawiki-dumps-legacy" [labs/private] - 10https://gerrit.wikimedia.org/r/1135472 (owner: 10Btullis)
[12:48:39] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2039.codfw.wmnet
[12:48:42] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:48:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2039.codfw.wmnet
[12:49:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet
[12:49:37] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[12:49:43] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[12:49:47] <wikibugs>	 (03CR) 10Jelto: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:49:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:50:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2040.codfw.wmnet
[12:50:28] <jinxer-wm>	 FIRING: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:50:35] <icinga-wm>	 RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms
[12:50:54] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[12:50:58] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[12:51:12] <wikibugs>	 (03PS5) 10Arnaudb: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833)
[12:51:36] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5416/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:51:58] <wikibugs>	 (03CR) 10Arnaudb: "Thanks @jwodstrcil@wikimedia.org for confirming there was something fishy" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:52:41] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service kubestagemaster2004:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:53:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2040.codfw.wmnet
[12:53:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet
[12:54:05] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5417/co" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[12:54:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster2004.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2004.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[12:55:26] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P75703 and previous config saved to /var/cache/conftool/dbconfig/20250430-125525-fceratto.json
[12:55:28] <jinxer-wm>	 RESOLVED: KeyholderUnarmed: 1 unarmed Keyholder key(s) on alert2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:57:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet
[12:58:48] <wikibugs>	 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10780146 (10Ladsgroup) 05Open→03Resolved Boldly closing: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=mwmaint1002&var-datasource=thanos&var-cluster=misc&from=...
[12:58:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2040.codfw.wmnet
[12:59:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2040.codfw.wmnet
[12:59:45] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM so far, though there might be a followup later (see my comment on the task)." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx)
[13:00:05] <jouncebot>	 Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1300).
[13:00:05] <jouncebot>	 anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:11] <Lucas_WMDE>	 o/
[13:00:39] <anzx>	 o/
[13:00:46] <Lucas_WMDE>	 anzx: do you mind if I update the commit message to add the english namespace aliases as well? makes for a more useful git log imho :)
[13:00:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow7001.magru.wmnet
[13:00:56] <Lucas_WMDE>	 (e.g. I just looked for some other commits adding portal namespaces for reference ^^)
[13:01:12] <Lucas_WMDE>	 anyway, I can deploy
[13:01:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2041.codfw.wmnet
[13:01:36] <anzx>	 Lucas_WMDE: sure
[13:01:58] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): mswikisource: add Karya (Work) and Gerbang (Portal) namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx)
[13:03:06] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx)
[13:03:08] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Stop passing krb2002 to Kerberos clients [puppet] - 10https://gerrit.wikimedia.org/r/1140142 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:03:21] <wikibugs>	 (03CR) 10Elukey: [C:03+1] Switch krb2002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1140143 (https://phabricator.wikimedia.org/T390863) (owner: 10Muehlenhoff)
[13:03:52] <wikibugs>	 (03Merged) 10jenkins-bot: mswikisource: add Karya (Work) and Gerbang (Portal) namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140129 (https://phabricator.wikimedia.org/T392984) (owner: 10Anzx)
[13:04:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1140129|mswikisource: add Karya (Work) and Gerbang (Portal) namespaces (T392984)]]
[13:04:21] <stashbot>	 T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984
[13:04:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow7001.magru.wmnet
[13:04:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2041.codfw.wmnet
[13:06:08] <wikibugs>	 (03CR) 10Jelto: [V:03+1] gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:07:34] <logmsgbot>	 !log stevemunene@deploy1003 Started deploy [analytics/refinery@ea1cff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ea1cff2c]
[13:07:59] <stevemunene>	 !log Deploying Refinery at 1136103: Add mad.wikisource to pageview allowlist | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1136103 T391767
[13:07:59] <stevemunene>	 !log deploying refinery at 1138395: Add rki.wikipedia to pageview allowlist | https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1138395 T392499
[13:08:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:03] <stashbot>	 T391767: Post-creation work for madwikisource - https://phabricator.wikimedia.org/T391767
[13:08:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:08:08] <stashbot>	 T392499: Post-creation work for rkiwiki - https://phabricator.wikimedia.org/T392499
[13:09:06] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Backport for [[gerrit:1140129|mswikisource: add Karya (Work) and Gerbang (Portal) namespaces (T392984)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:09:08] <anzx>	 Lucas_WMDE: checking
[13:09:09] <logmsgbot>	 !log stevemunene@deploy1003 Finished deploy [analytics/refinery@ea1cff2] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@ea1cff2c] (duration: 01m 35s)
[13:09:50] <anzx>	 Lucas_WMDE: looks good
[13:09:51] <logmsgbot>	 !log stevemunene@deploy1003 Started deploy [analytics/refinery@ea1cff2]: Regular analytics weekly train [analytics/refinery@ea1cff2c]
[13:09:55] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 anzx, lucaswerkmeister-wmde: Continuing with sync
[13:09:57] <Lucas_WMDE>	 great, thanks!
[13:10:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2041.codfw.wmnet
[13:10:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2041.codfw.wmnet
[13:10:33] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P75704 and previous config saved to /var/cache/conftool/dbconfig/20250430-131032-fceratto.json
[13:11:11] <XioNoX>	 !log adjust fundraising NAT policies - T392843
[13:11:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:22] <wikibugs>	 (03PS6) 10Arnaudb: gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833)
[13:11:56] <wikibugs>	 (03CR) 10Arnaudb: gerrit: switchover to gerrit1003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:12:21] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:12:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2042.codfw.wmnet
[13:13:02] <wikibugs>	 (03PS2) 10Hnowlan: mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534)
[13:13:17] <logmsgbot>	 !log stevemunene@deploy1003 Finished deploy [analytics/refinery@ea1cff2]: Regular analytics weekly train [analytics/refinery@ea1cff2c] (duration: 03m 25s)
[13:13:25] <Lucas_WMDE>	 anyone wanna +1 (some of) the changes in https://phabricator.wikimedia.org/T392819? then I could roll those out as well
[13:13:42] <wikibugs>	 (03CR) 10Hashar: gerrit: split Gerrit and Gitiles proxy pools (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139806 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[13:14:28] <wikibugs>	 (03PS1) 10Btullis: mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738)
[13:14:58] <wikibugs>	 (03CR) 10Brouberol: [C:03+1] mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[13:16:21] <logmsgbot>	 !log stevemunene@deploy1003 Started deploy [analytics/refinery@ea1cff2] (thin): Regular analytics weekly train THIN [analytics/refinery@ea1cff2c]
[13:16:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140129|mswikisource: add Karya (Work) and Gerbang (Portal) namespaces (T392984)]] (duration: 12m 10s)
[13:16:33] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-cluster
[13:16:33] <stashbot>	 T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984
[13:17:46] <logmsgbot>	 !log stevemunene@deploy1003 Finished deploy [analytics/refinery@ea1cff2] (thin): Regular analytics weekly train THIN [analytics/refinery@ea1cff2c] (duration: 01m 24s)
[13:17:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2042.codfw.wmnet
[13:18:05] <anzx>	 Lucas_WMDE: thank you for deploying, i will create patch for  defaultseachnamespace if local wiki member says on phab task it's ok to add 
[13:18:22] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10780239 (10ArthurPSmith) Hi - has this been done yet? I'm ready to test it on live Wi...
[13:20:01] <Lucas_WMDE>	 anzx: sounds good to me, thanks!
[13:20:14] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1230.eqiad.wmnet with reason: Maintenance
[13:20:47] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrus: re-enable completion index rebuild in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1139518 (owner: 10DCausse)
[13:20:50] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2213.codfw.wmnet with reason: Maintenance
[13:22:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[13:23:03] <jinxer-wm>	 FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[13:23:05] <logmsgbot>	 !log jnuche@deploy1003 Installing scap version "4.158.0" for 2 host(s)
[13:23:07] <sukhe>	 hello 
[13:23:09] <sukhe>	 !incidents
[13:23:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2042.codfw.wmnet
[13:23:09] <sirenbot>	 6073 (UNACKED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad)
[13:23:10] <sirenbot>	 6072 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[13:23:10] <sirenbot>	 6071 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet)
[13:23:10] <sirenbot>	 6070 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw)
[13:23:10] <sirenbot>	 6069 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad)
[13:23:11] <sirenbot>	 6068 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[13:23:14] <sukhe>	 !ack 6073
[13:23:14] <sirenbot>	 6073 (ACKED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad)
[13:23:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2042.codfw.wmnet
[13:23:33] <sukhe>	 wow, an actual runbook link!
[13:23:39] <godog>	 sukhe: I'm in a meeting, though that paged earlier today too, https://phabricator.wikimedia.org/T392989
[13:23:44] <sukhe>	 ah thanks godog 
[13:24:05] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel)
[13:24:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2043.codfw.wmnet
[13:24:42] <anzx>	 Lucas_WMDE: i forgot, could you run namespacedupes 
[13:24:54] <logmsgbot>	 !log jnuche@deploy1003 Installation of scap version "4.158.0" completed for 2 hosts
[13:25:40] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T392806)', diff saved to https://phabricator.wikimedia.org/P75705 and previous config saved to /var/cache/conftool/dbconfig/20250430-132539-fceratto.json
[13:25:55] <Lucas_WMDE>	 oh right
[13:25:56] <Lucas_WMDE>	 one sec
[13:25:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1183.eqiad.wmnet with reason: Maintenance
[13:26:04] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1183 (T392806)', diff saved to https://phabricator.wikimedia.org/P75706 and previous config saved to /var/cache/conftool/dbconfig/20250430-132604-fceratto.json
[13:26:13] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libcap2 [puppet] - 10https://gerrit.wikimedia.org/r/1140180
[13:26:14] <Raine>	 sukhe: are you on it or should I? Don't want to step on each other's toes 
[13:26:15] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm now, thanks! I think `ssh_allowed_hosts` can be reduced to the production host only. But that's something we can test after the switc" [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:26:21] <anzx>	 there are no pages, but just to be safe
[13:26:33] <wikibugs>	 (03CR) 10Hashar: gerrit: lower connections to Gitiles from 25 to 4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1139807 (https://phabricator.wikimedia.org/T392467) (owner: 10Hashar)
[13:26:46] <Lucas_WMDE>	 the script says there are four :P
[13:26:51] <Lucas_WMDE>	 and 106 links
[13:26:54] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] Add new PDU's in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul)
[13:27:39] <Lucas_WMDE>	 !log lucaswerkmeister-wmde@deploy1003 ~ $ mwscript-k8s --comment=T392984 --follow -- namespaceDupes mswikisource --fix | tee T392984
[13:27:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:27:43] <stashbot>	 T392984: Add new namespaces to Malay Wikisource - https://phabricator.wikimedia.org/T392984
[13:27:47] <sukhe>	 Raine: thanks, I am trying to figure out what to do here
[13:28:07] <Raine>	 sukhe: ack, same here :D 
[13:28:30] <Lucas_WMDE>	 anzx: done, thanks for the reminder!
[13:28:42] <wikibugs>	 (03CR) 10Jelto: [C:03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[13:29:04] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[13:29:08] <sukhe>	 Raine: it's trending downwards at least
[13:29:30] <anzx>	 Lucas_WMDE: thanks , i didn't check for English names so i thought no pages were present 
[13:29:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Add library hint for libcap2 [puppet] - 10https://gerrit.wikimedia.org/r/1140180 (owner: 10Muehlenhoff)
[13:29:48] <Lucas_WMDE>	 (I did a dry run of cleanupTitles just to check but there’s nothing to do there)
[13:29:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2043.codfw.wmnet
[13:31:02] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet
[13:31:25] <wikibugs>	 (03CR) 10Bking: [C:03+1] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[13:32:07] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10780327 (10tappof) @wiki_willy, please take a look at {T387866}. This will change how the row label is set and will also fix t...
[13:32:58] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[13:32:58] <Raine>	 sukhe: yeah, it is right now, though the last 7 days have been a bit higher than before
[13:32:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T392806)', diff saved to https://phabricator.wikimedia.org/P75707 and previous config saved to /var/cache/conftool/dbconfig/20250430-133258-fceratto.json
[13:33:03] <jinxer-wm>	 RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore1005 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore1005 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[13:33:08] <sukhe>	 ok then I guess :)
[13:33:11] <sukhe>	 I will update the task
[13:33:13] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2229.codfw.wmnet with reason: Maintenance
[13:33:15] <Raine>	 we've had this 2ish weeks ago and I'm not sure what the followup was
[13:33:27] <Raine>	 (other than creating the runbook)
[13:33:29] <sukhe>	 yeah godo.g shared the task above
[13:33:36] <Raine>	 ah, right, thank you
[13:34:57] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+1] "LGTM. While these definitions will no longer be needed for dashboarding purposes after merging https://gerrit.wikimedia.org/r/c/operations" [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul)
[13:35:11] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2043.codfw.wmnet
[13:35:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2043.codfw.wmnet
[13:35:45] <urandom>	 !log invoking `nodetool garbagecollect` on sessionstore1004 — T392989, T390514
[13:35:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:35:49] <stashbot>	 T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989
[13:36:32] <sukhe>	 ah thanks urandom :)
[13:36:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2044.codfw.wmnet
[13:36:49] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet
[13:37:53] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate mediamoderation-hourlyScan to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139415 (https://phabricator.wikimedia.org/T385799) (owner: 10Hnowlan)
[13:38:30] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet
[13:40:52] <urandom>	 sukhe: it's mainly diagnostic, I'm not sure if it will do anything (and this isn't The Way™ even if it does)
[13:41:08] <sukhe>	 well, certainly better you running these vs at least me I guess :)
[13:42:51] <logmsgbot>	 jmm@cumin2002 drain-node (PID 3838318) is awaiting input
[13:43:21] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[13:43:31] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[13:44:00] <godog>	 indeed, thanks urandom !
[13:44:20] <hnowlan>	 inflatador: I see you're reenabling an mw cronjob - if you were feeling adventurous and have more to do, we're currently migrating stuff to mw-cron (https://wikitech.wikimedia.org/wiki/Mw-cron_jobs)
[13:44:25] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet
[13:45:34] <inflatador>	 hnowlan Let me get a ticket started for that. We are migrating everything to OpenSearch, so this might be a good time to evaluate the mw-cron stuff more broadly
[13:45:44] <hnowlan>	 inflatador: nice! 
[13:46:15] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet
[13:47:23] <moritzm>	 FYI, kubestagemaster2003 and ml-etcd2003 will briefly go down for a Ganeti reboot (but with no impact given etcd redundancy guarantees)
[13:47:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2044.codfw.wmnet
[13:48:06] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P75708 and previous config saved to /var/cache/conftool/dbconfig/20250430-134805-fceratto.json
[13:48:19] <wikibugs>	 (03CR) 10David Caro: [C:03+1] "LGTM" [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah)
[13:48:31] <wikibugs>	 (03CR) 10David Caro: [C:03+1] admin: Temporarily remove Taavi's access [puppet] - 10https://gerrit.wikimedia.org/r/1140171 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah)
[13:48:34] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:49:22] <icinga-wm>	 PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:49:30] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[13:49:58] <icinga-wm>	 PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100%
[13:50:05] <wikibugs>	 (03PS1) 10David Caro: admin: temporarily remove dcaro access [puppet] - 10https://gerrit.wikimedia.org/r/1140181 (https://phabricator.wikimedia.org/T393000)
[13:50:10] <wikibugs>	 (03CR) 10Btullis: [C:03+2] mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[13:50:10] <Lucas_WMDE>	 guess I’m not getting a review for those patches in this window
[13:50:15] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[13:50:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:50:34] <icinga-wm>	 RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms
[13:50:46] <icinga-wm>	 RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.67 ms
[13:50:57] <wikibugs>	 (03CR) 10Majavah: [C:03+2] common: Temporarily remove some keys [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah)
[13:51:11] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10780399 (10Jelto) Thank you @MatthewVernon for digging into the logs. It was a bit tricky for me to find the actual path in the bu...
[13:51:36] <wikibugs>	 (03Merged) 10jenkins-bot: common: Temporarily remove some keys [homer/public] - 10https://gerrit.wikimedia.org/r/1140173 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah)
[13:51:44] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki-dumps-legacy: Rename the ssh private key secret [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140178 (https://phabricator.wikimedia.org/T390738) (owner: 10Btullis)
[13:52:21] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet
[13:52:42] <jinxer-wm>	 FIRING: [8x] ProbeDown: Service kubestagemaster2003:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:52:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2044.codfw.wmnet
[13:52:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2044.codfw.wmnet
[13:54:41] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-cluster
[13:55:16] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[13:55:55] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: migrate ancientpages jobs to Kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140151 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[13:57:07] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10780450 (10Eevans) >>! In T391544#10745829, @Eevans wrote: > >  [ ... ] > > The goal would be to make this a...
[13:57:47] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: EnotifNotifyJob: Forward-compat for wmf.27 jobs [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988)
[13:57:53] <wikibugs>	 (03CR) 10Papaul: [C:03+2] Add new PDU's in row E and F [puppet] - 10https://gerrit.wikimedia.org/r/1139966 (https://phabricator.wikimedia.org/T387504) (owner: 10Papaul)
[13:58:42] <jinxer-wm>	 FIRING: [10x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1400)
[14:00:49] <moritzm>	 !log installing libcap2 security updates
[14:00:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:54] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[14:02:02] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] P:mediawiki::maintenance::purge_loginnotify: migrate to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1139923 (https://phabricator.wikimedia.org/T388536) (owner: 10Scott French)
[14:02:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki2003.codfw.wmnet
[14:02:58] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] helm: remove duplicate alternatives::select entry [puppet] - 10https://gerrit.wikimedia.org/r/1140164 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto)
[14:03:12] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183', diff saved to https://phabricator.wikimedia.org/P75709 and previous config saved to /var/cache/conftool/dbconfig/20250430-140312-fceratto.json
[14:04:12] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2209.codfw.wmnet with reason: Maintenance
[14:06:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki2003.codfw.wmnet
[14:06:38] <mszabo>	 jouncebot: nowandnext
[14:06:38] <jouncebot>	 For the next 0 hour(s) and 53 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1400)
[14:06:38] <jouncebot>	 In 2 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700)
[14:08:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:09:03] <jinxer-wm>	 FIRING: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[14:09:24] <Raine>	 !incidents
[14:09:24] <sirenbot>	 You're not allowed to perform this action.
[14:09:24] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "That looks good, but given that confd is an internal tool, let's maybe also create a task to fix the underlying behaviour? Ideally confd s" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway)
[14:09:30] <Raine>	 oh XD
[14:09:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host rpki1001.eqiad.wmnet
[14:09:51] <godog>	 I'm in a meeting, though see T392989
[14:09:52] <stashbot>	 T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989
[14:09:55] <godog>	 !incidents
[14:09:56] <sirenbot>	 6074 (UNACKED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2006:9100 node /srv codfw)
[14:09:56] <sirenbot>	 6073 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1005:9100 node /srv eqiad)
[14:09:56] <sirenbot>	 6072 (RESOLVED)  Primary inbound port utilisation over 80%  (paged) network noc (cr2-eqiad.wikimedia.org)
[14:09:56] <sirenbot>	 6071 (RESOLVED)  Primary outbound port utilisation over 80%  (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet)
[14:09:56] <sirenbot>	 6070 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2005:9100 node /srv codfw)
[14:09:57] <sirenbot>	 6069 (RESOLVED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore1006:9100 node /srv eqiad)
[14:09:57] <sirenbot>	 6068 (RESOLVED)  [2x] ProbeDown sre (ml-serve-ctrl2001:6443 probes/custom codfw)
[14:10:01] <godog>	 !ack 6074
[14:10:01] <sirenbot>	 6074 (ACKED)  SessionStoreDiskSpaceUtilizationTooHigh sessionstore data-persistence (/dev/mapper/vg0-srv ext4 sessionstore2006:9100 node /srv codfw)
[14:10:11] <godog>	 cc urandom :(
[14:11:27] <urandom>	 sorry...
[14:11:29] <urandom>	 working on it
[14:11:37] <urandom>	 maybe we can create a silence
[14:11:51] <moritzm>	 !log failover Ganeti master in codfw to ganeti2021
[14:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:12:00] <sukhe>	 thanks folks
[14:12:14] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:12:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rpki1001.eqiad.wmnet
[14:12:24] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:13:52] <Raine>	 urandom: I can create the silence, how long do you think?
[14:14:03] <jinxer-wm>	 RESOLVED: SessionStoreDiskSpaceUtilizationTooHigh: Session storage disk space utilization on sessionstore2006 is too high #page - https://wikitech.wikimedia.org/wiki/SessionStorage/Runbook#High_Storage_Utilization - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=sessionstore2006 - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreDiskSpaceUtilizationTooHigh
[14:14:05] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0)
[14:14:30] <icinga-wm>	 PROBLEM - ganeti-wconfd running on ganeti2032 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 110 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti
[14:14:35] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be2001.codfw.wmnet
[14:16:07] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: move mostrevisions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534)
[14:17:29] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10780516 (10MatthewVernon) >>! In T391544#10749423, @Eevans wrote: >>>! In T391544#10746698, @MatthewVernon wr...
[14:17:46] <wikibugs>	 (03CR) 10Hnowlan: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[14:18:17] <wikibugs>	 (03PS1) 10Ssingh: wikimedia-ech.org: update zone file and add A/AAAA records [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378)
[14:18:19] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1183 (T392806)', diff saved to https://phabricator.wikimedia.org/P75710 and previous config saved to /var/cache/conftool/dbconfig/20250430-141819-fceratto.json
[14:18:38] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1185.eqiad.wmnet with reason: Maintenance
[14:18:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:18:46] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1185 (T392806)', diff saved to https://phabricator.wikimedia.org/P75711 and previous config saved to /var/cache/conftool/dbconfig/20250430-141845-fceratto.json
[14:19:10] * Raine creating a silence for the SessionStore alerts for 12h
[14:19:10] <wikibugs>	 (03CR) 10Ssingh: "sigh, wrong file 😞" [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:20:30] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2001.codfw.wmnet
[14:21:39] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be2002.codfw.wmnet
[14:22:15] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10780556 (10Jhancock.wm)
[14:22:43] <wikibugs>	 (03PS1) 10Máté Szabó: popup: Fix target user name for expired temporary account links [extensions/IPInfo] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140191 (https://phabricator.wikimedia.org/T393002)
[14:23:05] <wikibugs>	 (03Abandoned) 10GergesShamon: Change Arabic Wikipedia tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139912 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[14:23:10] <wikibugs>	 (03Abandoned) 10GergesShamon: Change Arabic Wikipedia logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1139932 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[14:23:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszabo@deploy1003 using scap backport" [extensions/IPInfo] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140191 (https://phabricator.wikimedia.org/T393002) (owner: 10Máté Szabó)
[14:23:54] <wikibugs>	 (03PS2) 10Ssingh: wikimedia-ech.org: update zone file and add A/AAAA records [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378)
[14:24:38] <wikibugs>	 10ops-codfw, 06SRE, 06cloud-services-team, 06DC-Ops, 10decommission-hardware: decommission cloudlb2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T392686#10780574 (10Jhancock.wm) @Andrew i can't use the offline script in netbox. looks like some of the interfaces are a little too complicated for...
[14:25:40] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:26:37] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T392806)', diff saved to https://phabricator.wikimedia.org/P75712 and previous config saved to /var/cache/conftool/dbconfig/20250430-142636-fceratto.json
[14:26:44] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:27:15] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2002.codfw.wmnet
[14:28:29] <wikibugs>	 (03CR) 10Ssingh: "From the durum host:" [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:28:32] <logmsgbot>	 !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet
[14:29:33] <moritzm>	 !log installing ruby2.7 security updates
[14:29:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:32] <wikibugs>	 (03PS1) 10GergesShamon: [arwiki] Change logo and tagline with sync wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858)
[14:31:44] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015 (10RobH) 03NEW
[14:32:14] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install build2003.codfw.wmnet - https://phabricator.wikimedia.org/T393015#10780630 (10RobH)
[14:32:56] <wikibugs>	 (03PS1) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016)
[14:33:41] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[14:33:47] <wikibugs>	 (03PS1) 10AikoChou: ml-services: add edit-check-cpu isvc for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195
[14:34:48] <logmsgbot>	 !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet
[14:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:35:34] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 30 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140193 (https://phabricator.wikimedia.org/T392858) (owner: 10GergesShamon)
[14:35:56] <wikibugs>	 (03Merged) 10jenkins-bot: popup: Fix target user name for expired temporary account links [extensions/IPInfo] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140191 (https://phabricator.wikimedia.org/T393002) (owner: 10Máté Szabó)
[14:36:24] <logmsgbot>	 !log mszabo@deploy1003 Started scap sync-world: Backport for [[gerrit:1140191|popup: Fix target user name for expired temporary account links (T393002)]]
[14:36:29] <stashbot>	 T393002: IPInfo: IPInfo popup for expired temporary accounts is not working - https://phabricator.wikimedia.org/T393002
[14:37:09] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10780656 (10fnegri)
[14:39:30] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device ssw1-e1-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[14:39:37] <wikibugs>	 (03CR) 10Bking: [C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[14:39:42] <wikibugs>	 (03PS10) 10Federico Ceratto: Add switchover cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810)
[14:39:45] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[14:39:52] <wikibugs>	 (03CR) 10Federico Ceratto: "Small update." [cookbooks] - 10https://gerrit.wikimedia.org/r/1129904 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto)
[14:39:55] <wikibugs>	 (03PS3) 10Ebernhardson: Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592)
[14:40:00] <wikibugs>	 (03CR) 10Bking: [C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[14:40:02] <wikibugs>	 (03CR) 10Bking: [V:03+2 C:03+2] Revert^2 "Update opensearch-madvise call for version 0.2" [puppet] - 10https://gerrit.wikimedia.org/r/1139888 (https://phabricator.wikimedia.org/T390592) (owner: 10Ebernhardson)
[14:40:29] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1236.eqiad.wmnet with reason: Maintenance
[14:40:47] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2218.codfw.wmnet with reason: Maintenance
[14:40:52] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] "in the long term I'm guessing we should include wikimedia-ech.org as part of our unified cert and serve this from the CDN or as a one-off " [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:40:56] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] mw::maintenance: move mostrevisions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[14:41:44] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P75713 and previous config saved to /var/cache/conftool/dbconfig/20250430-144144-fceratto.json
[14:41:52] <wikibugs>	 (03CR) 10Ssingh: "Yes, that's correct. When we get to that, we should certainly include it there." [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:42:04] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] wikimedia-ech.org: update zone file and add A/AAAA records [dns] - 10https://gerrit.wikimedia.org/r/1140189 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh)
[14:42:11] <logmsgbot>	 !log sukhe@dns1004 START - running authdns-update
[14:42:53] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Backport for [[gerrit:1140191|popup: Fix target user name for expired temporary account links (T393002)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:42:58] <stashbot>	 T393002: IPInfo: IPInfo popup for expired temporary accounts is not working - https://phabricator.wikimedia.org/T393002
[14:44:03] <logmsgbot>	 !log mszabo@deploy1003 mszabo: Continuing with sync
[14:44:16] <wikibugs>	 (03PS1) 10Bking: cirrussearch: fix typo in systemd timer resource [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592)
[14:44:27] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) (owner: 10Bking)
[14:44:30] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device ssw1-e1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[14:44:41] <logmsgbot>	 !log sukhe@dns1004 END - running authdns-update
[14:45:57] <urandom>	 !log invoking `nodetool garbagecollect` on sessionstore2004 — T390514, T392989
[14:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:46:02] <stashbot>	 T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989
[14:47:10] <wikibugs>	 (03CR) 10Bking: [C:03+2] opensearch: drop minimum_master_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1125478 (owner: 10DCausse)
[14:47:28] <wikibugs>	 (03CR) 10Ebernhardson: [C:03+1] cirrussearch: fix typo in systemd timer resource [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) (owner: 10Bking)
[14:47:30] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200
[14:47:38] <wikibugs>	 (03CR) 10Bking: [C:03+2] cirrussearch: fix typo in systemd timer resource [puppet] - 10https://gerrit.wikimedia.org/r/1140199 (https://phabricator.wikimedia.org/T390592) (owner: 10Bking)
[14:48:06] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200 (owner: 10Arturo Borrero Gonzalez)
[14:48:16] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10780708 (10RobH) New remote hands entered to get this fixed: Case Order #01053614    > Directions for remote hands to repair our link between cr3 an...
[14:48:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] mw::maintenance: move mostrevisions jobs to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1140188 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[14:49:12] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: admin: temporarily remove ssh key for aborrero [puppet] - 10https://gerrit.wikimedia.org/r/1140200
[14:49:31] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device lsw1-e1-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[14:50:47] <logmsgbot>	 !log mszabo@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140191|popup: Fix target user name for expired temporary account links (T393002)]] (duration: 14m 22s)
[14:50:52] <stashbot>	 T393002: IPInfo: IPInfo popup for expired temporary accounts is not working - https://phabricator.wikimedia.org/T393002
[14:51:50] <wikibugs>	 (03PS2) 10Fabfur: haproxykafka: service unit brought by deb package [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016)
[14:52:45] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:53:41] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[14:53:53] <wikibugs>	 (03CR) 10Arnaudb: [C:03+1] "Done" [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[14:53:58] <wikibugs>	 (03PS4) 10Arnaudb: gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833)
[14:54:12] <moritzm>	 !log installing werkzeug security updates
[14:54:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:54:31] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device lsw1-e1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[14:55:01] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply
[14:55:12] <logmsgbot>	 !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply
[14:55:25] <arnaudb>	 hello, I'll be switching Gerrit over in 20min (15:15 UTC), operation should take a few minutes, apologies for the temporarily unavailability, I'll post any relevant update here
[14:56:19] <wikibugs>	 (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 (owner: 10AikoChou)
[14:56:52] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P75715 and previous config saved to /var/cache/conftool/dbconfig/20250430-145651-fceratto.json
[14:57:17] <jinxer-wm>	 FIRING: [25x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:43] <jinxer-wm>	 FIRING: [31x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:58:46] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1162.eqiad.wmnet with reason: Maintenance
[14:59:04] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: Maintenance
[14:59:33] <urandom>	 !log invoking `nodetool garbagecollect` on sessionstore[2005-2006].codfw.wmnet,sessionstore[1005-1006].eqiad.wmnet — T390514, T392989
[14:59:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:59:38] <stashbot>	 T392989: SessionStoreDiskSpaceUtilizationTooHigh brief spike - https://phabricator.wikimedia.org/T392989
[15:01:14] <wikibugs>	 (03CR) 10JHathaway: "Yeah, I am not super happy with the less holistic approach in this patch. However, I don't think a confd change is likely, given the seman" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway)
[15:01:23] <moritzm>	 !log installing postgresql-15 security updates
[15:01:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:17] <jinxer-wm>	 FIRING: [46x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:02:24] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[15:03:21] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] haproxykafka: service unit brought by deb package (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1140194 (https://phabricator.wikimedia.org/T393016) (owner: 10Fabfur)
[15:03:42] <jinxer-wm>	 FIRING: [54x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:03:47] <wikibugs>	 (03CR) 10AikoChou: [C:03+2] ml-services: add edit-check-cpu isvc for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 (owner: 10AikoChou)
[15:05:41] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: add edit-check-cpu isvc for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1140195 (owner: 10AikoChou)
[15:05:44] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[15:06:06] <wikibugs>	 (03CR) 10Herron: [C:03+1] "nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1140135 (https://phabricator.wikimedia.org/T392994) (owner: 10Filippo Giunchedi)
[15:06:43] <hashar>	 jouncebot: nowandnext
[15:06:44] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 53 minute(s)
[15:06:44] <jouncebot>	 In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700)
[15:07:04] <wikibugs>	 (03CR) 10Hashar: [C:03+2] EnotifNotifyJob: Forward-compat for wmf.27 jobs [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński)
[15:07:17] <jinxer-wm>	 FIRING: [68x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:07:34] <wikibugs>	 (03CR) 10Hashar: [C:03+2] "I am +2ing this now to get CI to kick. I will deploy it after Gerrit has been switched over to another server." [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński)
[15:08:42] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job routinator in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:08:42] <jinxer-wm>	 FIRING: [70x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:11:36] <wikibugs>	 (03Merged) 10jenkins-bot: EnotifNotifyJob: Forward-compat for wmf.27 jobs [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński)
[15:11:59] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T392806)', diff saved to https://phabricator.wikimedia.org/P75717 and previous config saved to /var/cache/conftool/dbconfig/20250430-151158-fceratto.json
[15:12:16] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1200.eqiad.wmnet with reason: Maintenance
[15:12:17] <jinxer-wm>	 FIRING: [91x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:12:23] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1200 (T392806)', diff saved to https://phabricator.wikimedia.org/P75718 and previous config saved to /var/cache/conftool/dbconfig/20250430-151222-fceratto.json
[15:12:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Ack, thanks for the additional context" [puppet] - 10https://gerrit.wikimedia.org/r/1139893 (owner: 10JHathaway)
[15:14:08] <arnaudb>	 Will start gerrit switchover in 1 min
[15:15:11] * arnaudb starts
[15:15:17] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: switchover to gerrit1003 [dns] - 10https://gerrit.wikimedia.org/r/1137106 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[15:15:30] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] gerrit: switchover to gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/1137107 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb)
[15:15:49] <logmsgbot>	 !log arnaudb@dns1004 START - running authdns-update
[15:16:32] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.gerrit.failover from gerrit2002.wikimedia.org to gerrit1003.wikimedia.org
[15:17:17] <jinxer-wm>	 FIRING: [115x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:18:32] <wikibugs>	 06SRE, 10SRE-swift-storage, 10Ceph, 06collaboration-services, and 2 others: Migrate gitlab storage to apus (also: backups from S3?) - https://phabricator.wikimedia.org/T378922#10780806 (10MatthewVernon) Two thoughts - first, sorry, I was rebooting all the things today because of T392804 which //shouldn't//...
[15:18:43] <jinxer-wm>	 FIRING: [118x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:19:36] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device lsw1-e3-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:20:06] <Krinkle>	 !log Removed @joaquin (former staff) from https://www.npmjs.com/settings/wikimedia/members
[15:20:09] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T392806)', diff saved to https://phabricator.wikimedia.org/P75719 and previous config saved to /var/cache/conftool/dbconfig/20250430-152007-fceratto.json
[15:20:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:24] <moritzm>	 !log installing ucf security updates
[15:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:20:59] <Krinkle>	 !log Removed @nrayio (former staff [[User:NRay (WMF)]]) from https://www.npmjs.com/settings/wikimedia/members
[15:21:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:49] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:21:54] <mutante>	 !log gerrit failover in progress
[15:21:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:22:17] <jinxer-wm>	 FIRING: [141x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:22:45] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:23:33] <HouseOfM>	 Is gerrit outage planned?
[15:23:43] <jinxer-wm>	 FIRING: [147x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:24:26] <mutante>	 HouseOfM: yes, it is planned
[15:24:36] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device lsw1-e3-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:24:43] <HouseOfM>	 mutante: Thanks :)
[15:25:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:27:14] <mutante>	 any alert relating to "something git pull" is indirect alerting about gerrit failover
[15:27:17] <jinxer-wm>	 FIRING: [153x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:28:42] <jinxer-wm>	 FIRING: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:29:13] <logmsgbot>	 !log arnaudb@dns1004 END - running authdns-update
[15:29:26] <logmsgbot>	 arnaudb@cumin1002 failover (PID 963080) is awaiting input
[15:30:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:30:45] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Possible frdb2004 hardware failure. - https://phabricator.wikimedia.org/T392579#10780837 (10Jgreen) 05Open→03Resolved >>! In T392579#10777103, @Jhancock.wm wrote: > @Jgreen reseated all the connections to the backplane. server came up. I checked...
[15:33:18] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.10 point update - https://phabricator.wikimedia.org/T389034#10780879 (10MoritzMuehlenhoff)
[15:33:45] <arnaudb>	 puppet agent running on hosts, ETA 3 to 5 minutes
[15:34:36] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device ssw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:35:16] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P75720 and previous config saved to /var/cache/conftool/dbconfig/20250430-153516-fceratto.json
[15:36:45] <icinga-wm>	 PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:37:30] <logmsgbot>	 arnaudb@cumin1002 failover (PID 963080) is awaiting input
[15:37:35] <icinga-wm>	 RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.197 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:38:39] <logmsgbot>	 !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.gerrit.failover (exit_code=97) from gerrit2002.wikimedia.org to gerrit1003.wikimedia.org
[15:38:42] <jinxer-wm>	 RESOLVED: [5x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:39:36] <jinxer-wm>	 FIRING: [2x] Emergency syslog message: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:40:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:43:25] <hashar>	 jouncebot: refresh
[15:43:26] <jouncebot>	 I refreshed my knowledge about deployments.
[15:43:28] <hashar>	 jouncebot: nowandnext
[15:43:28] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 16 minute(s)
[15:43:28] <jouncebot>	 In 1 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700)
[15:43:32] <hashar>	 I will deploy I'll merge https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1140184
[15:43:38] <hashar>	 for the train
[15:44:36] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device lsw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[15:50:24] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P75722 and previous config saved to /var/cache/conftool/dbconfig/20250430-155023-fceratto.json
[15:50:31] <wikibugs>	 (03PS1) 10Mforns: Add file and filetypes tables to the mediawiki-not-history sqoop [puppet] - 10https://gerrit.wikimedia.org/r/1139115 (https://phabricator.wikimedia.org/T389800)
[15:52:14] <wikibugs>	 (03CR) 10Bking: "Adding Reuven to list of reviewers per IRC conversation. This is a pretty old patch, and we wanna make sure we don't end up breaking envoy" [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson)
[15:53:21] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply
[15:53:34] <logmsgbot>	 !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply
[15:53:47] <wikibugs>	 (03PS1) 10Elukey: icinga: skip services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848)
[15:55:09] <wikibugs>	 (03CR) 10Elukey: "I'll wait for Riccardo to review this but the basic functionality should be there!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey)
[15:56:29] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack, 06Traffic, 13Patch-For-Review: Spicerack's Icinga module should provide a way to skip specific services in sub-optimal but desired state - https://phabricator.wikimedia.org/T392848#10780975 (10elukey) I filed https://gerrit.wikimedia.org/r/c/operati...
[15:58:58] <icinga-wm>	 PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 6/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[15:59:58] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:02:47] <logmsgbot>	 !log aikochou@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[16:02:52] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:02:52] <icinga-wm>	 RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:04:11] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10781006 (10ArthurPSmith) Since it's well after 10:00 UTC I gave it a try - problem is...
[16:04:36] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:05:21] <wikibugs>	 (03CR) 10CI reject: [V:04-1] icinga: skip services in wait_for_optimal if needed [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey)
[16:05:31] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T392806)', diff saved to https://phabricator.wikimedia.org/P75724 and previous config saved to /var/cache/conftool/dbconfig/20250430-160530-fceratto.json
[16:05:49] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1210.eqiad.wmnet with reason: Maintenance
[16:05:56] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1210 (T392806)', diff saved to https://phabricator.wikimedia.org/P75725 and previous config saved to /var/cache/conftool/dbconfig/20250430-160556-fceratto.json
[16:07:35] <wikibugs>	 (03PS7) 10Krinkle: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery)
[16:07:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:07:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery)
[16:09:24] <wikibugs>	 (03Merged) 10jenkins-bot: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:09:26] <wikibugs>	 (03Merged) 10jenkins-bot: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery)
[16:09:36] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device lsw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:10:09] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate mostlinked job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140214 (https://phabricator.wikimedia.org/T388534)
[16:11:08] <arnaudb>	 Gerrit is working properly, _but_ we had a slight hiccup on replication where we'll have to dig a bit further to see why we have trouble detecting "replica status" efficiently after the switchover. Consequentially, puppet-agent has been disabled on gerrit2002 (the replica), to ensure it stays in a consistent state until we figure this situation
[16:11:08] <arnaudb>	 out. Anyway, thank you all for your patience
[16:11:19] <Krinkle>	 hashar: "16:09:31 The following are unexpected commits pulled from origin for /srv/mediawiki-staging/php-1.44.0-wmf.25:"
[16:11:37] <Krinkle>	 Oops I see your comment now there, "I am +2ing this now to get CI to kick. I will deploy it after Gerrit has been switched over to another server."
[16:11:58] <hashar>	 yes
[16:12:03] <Krinkle>	 I'll roll it out now
[16:12:04] <hashar>	 sorry I am going to deploy it rightn ow
[16:12:10] <hashar>	 I was waiting for the Gerrit maintenance to be completed
[16:12:19] <hashar>	 arnaudb: make sure to `!log` it :)
[16:12:25] <Krinkle>	 I have a backport command running with two config patches
[16:12:28] <logmsgbot>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]]
[16:12:30] <arnaudb>	 oh you're right! sorry
[16:12:35] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T392806)', diff saved to https://phabricator.wikimedia.org/P75726 and previous config saved to /var/cache/conftool/dbconfig/20250430-161234-fceratto.json
[16:12:36] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[16:12:36] <stashbot>	 T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994
[16:12:36] <arnaudb>	 !log Gerrit maintenance over
[16:12:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:12:51] <hashar>	 oh there is another backport
[16:12:57] <hashar>	 Krinkle: you are deploying right?
[16:13:14] <Krinkle>	 I pressed "y" to the unexpected commit yes
[16:13:23] <Krinkle>	 I don't know what happens if I press No.
[16:13:41] <hashar>	 we roll back to last known good version: the perl based wiki software
[16:13:43] <Krinkle>	 does it undo that patch or exclude it? or does it abort everything and prompt the next person? or does it forget after one person sees the prompt?
[16:13:54] <hashar>	 I don't know to be fair. I guess it will just stop there
[16:14:05] <hashar>	 leave the unexpected commit in place for human to investiate
[16:14:27] <hashar>	 until I guess someone pull the patch
[16:14:38] <hashar>	 well I don't know
[16:14:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10781063 (10MMiller_WMF) I am Madalina's manager and I approve her access to this data and these tools.
[16:14:56] <Krinkle>	 hm.. well that depends on how it discovers it. it's not obvious to me that once it pulls it down it will know next time that it is still new/undeployed.
[16:15:22] <Krinkle>	 This could use better documentation and/or explicit prompt what it will do.
[16:16:05] <dancy>	 If you answer no, the backport is cancelled.
[16:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:17:19] <dancy>	 And, indeed, if you run another operation after that, the new operation won't re-complain about the prior unexpected commit.
[16:18:37] <Krinkle>	 if it uses git fetch and git rebase (which the manual steps used to recommend, instead of git pull) then it would presumably be able to discovre it next time as well.
[16:18:43] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[16:18:53] <logmsgbot>	 !log krinkle@deploy1003 krinkle, pppery: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[16:18:56] <Krinkle>	 i.e. it will never have applied it to mediawiki-staging until after saying 'y' yes
[16:18:57] <wikibugs>	 (03PS1) 10Hnowlan: mw::maintenance: migrate all remaining general updatequerypages jobs [puppet] - 10https://gerrit.wikimedia.org/r/1140216 (https://phabricator.wikimedia.org/T388534)
[16:18:59] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[16:19:00] <stashbot>	 T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994
[16:19:31] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:19:36] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device lsw1-f3-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:19:39] <Krinkle>	 dancy: assuming it doesn't work that way, how does it work? I guess based on the security patch logic, it rebuilds the tree somewhere in a temporary space, but then how does it find that something is new?
[16:21:09] <wikibugs>	 (03PS1) 10AOkoth: aphlict: revert eqiad host to active [puppet] - 10https://gerrit.wikimedia.org/r/1140217 (https://phabricator.wikimedia.org/T392128)
[16:22:18] <logmsgbot>	 !log krinkle@deploy1003 krinkle, pppery: Continuing with sync
[16:23:12] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[16:23:32] <dancy>	 Krinkle: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/backport.py?ref_type=heads#L1024 is the code that does the checking.  Looks like it still does use `git fetch`, so I think we're still good.
[16:23:50] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[16:24:36] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device lsw1-f3-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:24:49] <Krinkle>	 dancy: hm.. so maybe it will prompt the next person as well!
[16:24:56] <wikibugs>	 (03PS1) 10AOkoth: wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128)
[16:25:06] <dancy>	 Yes, I would expect so after reevaluating.  
[16:25:09] <Krinkle>	 unless they pass that previous-unexpected change to scap-backport as argument, I guess.
[16:25:15] <dancy>	 Right
[16:25:29] <Krinkle>	 I might try that sometime.
[16:26:11] <wikibugs>	 (03CR) 10Hashar: [C:03+2] "There was a Gerrit maintenance that started immediately after the patch got merged.  It is now being deployed by Timo as part of another d" [core] (wmf/1.44.0-wmf.25) - 10https://gerrit.wikimedia.org/r/1140184 (https://phabricator.wikimedia.org/T392988) (owner: 10Bartosz Dziewoński)
[16:26:56] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops: codfw expansion infrastructure racking task - https://phabricator.wikimedia.org/T387504#10781132 (10Papaul)
[16:27:34] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029 (10RobH) 03NEW
[16:27:42] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P75727 and previous config saved to /var/cache/conftool/dbconfig/20250430-162741-fceratto.json
[16:27:53] <wikibugs>	 (03PS1) 10Lucas Werkmeister (WMDE): Temporarily remove lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1140219
[16:28:04] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10781146 (10RobH) a:03BTullis Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the new se...
[16:28:09] <dancy>	 Krinkle: I'll also test in train-dev later and let you know what the results are.
[16:28:17] <Krinkle>	 ack
[16:28:21] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T393029#10781155 (10RobH)
[16:28:55] <logmsgbot>	 !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]] (duration: 16m 26s)
[16:29:01] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[16:29:01] <stashbot>	 T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994
[16:32:36] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030 (10RobH) 03NEW
[16:32:59] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10781214 (10RobH) a:03BTullis Please note the workflow for racking tasks has changed this fiscal year, and we now require the puppet updates from the sub-team receiving the ne...
[16:34:56] <icinga-wm>	 PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/3 UP : 4 v2 P2P interfaces vs. 3 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:35:33] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "Thank you for using the `primary` / `replica` semantic.  The license looks good to me, at least it is not re-licensing to Apache 2.0 :]" [puppet] - 10https://gerrit.wikimedia.org/r/1137840 (https://phabricator.wikimedia.org/T392212) (owner: 10Dzahn)
[16:35:54] <icinga-wm>	 RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[16:38:16] <hashar>	 Krinkle: let me know when the patch are deployed and I will proceed with the train
[16:38:25] <Krinkle>	 hashar: it's done.
[16:38:31] <hashar>	 awesome
[16:39:12] <wikibugs>	 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q4:rack/setup/install an-test-master100[34] - https://phabricator.wikimedia.org/T393030#10781244 (10RobH)
[16:41:31] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] admin: temporarily remove dcaro access [puppet] - 10https://gerrit.wikimedia.org/r/1140181 (https://phabricator.wikimedia.org/T393000) (owner: 10David Caro)
[16:41:47] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] admin: Temporarily remove Taavi's access [puppet] - 10https://gerrit.wikimedia.org/r/1140171 (https://phabricator.wikimedia.org/T393000) (owner: 10Majavah)
[16:42:01] <wikibugs>	 (03CR) 10Tiziano Fogli: [C:03+2] admin: move jiji to ops-limited [puppet] - 10https://gerrit.wikimedia.org/r/1140145 (https://phabricator.wikimedia.org/T392998) (owner: 10Effie Mouzeli)
[16:42:48] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210', diff saved to https://phabricator.wikimedia.org/P75728 and previous config saved to /var/cache/conftool/dbconfig/20250430-164248-fceratto.json
[16:42:58] <hashar>	 I am runinng the train
[16:43:21] <wikibugs>	 (03PS4) 10TrainBranchBot: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:43:21] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:43:21] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222)
[16:43:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[16:44:13] <Krinkle>	 that's weird. https://gerrit.wikimedia.org/r/1138922 was already deployed?
[16:44:58] <wikibugs>	 (03CR) 10Bking: [C:03+2] refactor(opensearch): use Netbox to get rack / row information [puppet] - 10https://gerrit.wikimedia.org/r/1137272 (https://phabricator.wikimedia.org/T391392) (owner: 10Gehel)
[16:45:22] <Krinkle>	 hashar: oh, gerrit lost some events during hte switch I guess?
[16:45:32] <hashar>	 hmmm maybe
[16:45:36] <Krinkle>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138508 is back to how it was an hour ago
[16:45:39] <Krinkle>	 missing the latest rebase and merge
[16:45:57] <hashar>	 but it is merged?
[16:45:57] <wikibugs>	 06SRE, 06serviceops-radar: mwmaint1002 is out of disk space - https://phabricator.wikimedia.org/T392834#10781287 (10Dzahn) imho we should have something that effectively notifies a team (automatic task, email) so next time we don't need to rely on manually created tickets by users
[16:45:58] <Krinkle>	 what does that mean for the underlying git-repo?
[16:46:12] <wikibugs>	 (03Merged) 10jenkins-bot: missing.php: Redesign to match current error pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138922 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[16:46:14] <wikibugs>	 (03Merged) 10jenkins-bot: group1 to 1.44.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[16:46:33] <Krinkle>	 it was merged yes, but then the server switched to a version that is in the past
[16:46:39] <Krinkle>	 so now prod and gerrit are forked
[16:46:48] <Krinkle>	 my local also deviates in its remote reflection
[16:46:51] <hashar>	 holy shit 
[16:47:26] <dancy>	 yikes
[16:47:27] <Krinkle>	 was gerrit meant to be in read-only mode during this maintenance? Or was it meant to catch up afterward.
[16:47:28] <hashar>	 mutante: arnaudb: sobanski: thcipriani: so looks like the Gerrit switch over caused repos to rollback in time
[16:48:09] <thcipriani>	 hashar: blarg.
[16:48:14] <Krinkle>	 both git repos and gerrit db are back in time, i.e. comments and votes also missing
[16:49:22] <thcipriani>	 this has happened before. The last time it happened we didn't have the --delete flag for rsync which caused git to look at loose refs rather than packed refs
[16:49:36] <jinxer-wm>	 FIRING: Emergency syslog message: Alert for device lsw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:49:46] <hashar>	 that is why I was aksing why we did a rsync of the git repos :)
[16:49:48] <hashar>	 but then
[16:49:49] <thcipriani>	 Krinkle: do you have an example for investigation?
[16:50:01] <hashar>	 if a change is merged on the primary, itshould be replicated to the secondary
[16:50:06] <hashar>	 err to the replica
[16:50:10] <Krinkle>	 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138508
[16:50:19] <thcipriani>	 thanks
[16:50:20] <Krinkle>	 this change was rebased by me, and then +2'ed/merged by train bot
[16:50:32] <Krinkle>	 irc backscroll has the receipts of this events
[16:50:51] <Krinkle>	 apparently part of the database says it is merged i.e. the relation chain 
[16:51:10] <Krinkle>	 but Gitiles and the change page are back in time
[16:51:49] <hashar>	 we also have some kind of audit trail in /var/log/zuul/zuul.log.* on contint1002.wikimedia.org
[16:52:17] <thcipriani>	 confirmed
[16:52:25] <thcipriani>	 well wait
[16:52:33] <hashar>	 there is no gate-and-submit there
[16:52:50] <hashar>	 zuul.log.2025-04-24:2025-04-24 20:51:37,503 INFO zuul.IndependentPipelineManager: Reporting item <QueueItem 0x7f5eb805ec50 for <Change 0x7f5eb9e90f90 1138508,6> in test-prio>, actions: [<GerritReporter connection: gerrit://gerrit>]
[16:53:07] <hashar>	 which matches the last comment on Gerrit
[16:53:24] <taavi>	 github mirrors have also stopped updating
[16:53:24] <hashar>	 somewhere above I see: Change <Change 0x7f5eb9e90f90 1138508,6> depends on changes [<Change 0x7f5eae494950 1138922,1>, <Change 0x7f5e53405090 1138921,1>]
[16:54:04] <hashar>	 and that is about it
[16:54:17] <hashar>	 so I don't think that change ever got a +2 or a merge. At least according to CI logs
[16:54:36] <jinxer-wm>	 RESOLVED: Emergency syslog message: Device lsw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message   - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message
[16:54:39] <thcipriani>	 ok, now confirmed: git for-each-ref shows fad64230b40174d5f90f2095e1eb0f8561c96421 commit refs/changes/08/1138508/meta same as refs/changes/08/1138508/meta file on disk that is from 2025-04-24
[16:54:43] <wikibugs>	 (03CR) 10BCornwall: [C:03+1] wmnet: revert active aphlict host to eqiad [dns] - 10https://gerrit.wikimedia.org/r/1140218 (https://phabricator.wikimedia.org/T392128) (owner: 10AOkoth)
[16:55:10] <thcipriani>	 meanwhile grep 1138508 packed-refs -> 90167f46357593f19e0a5ad8fea8469b0a66a018 refs/changes/08/1138508/meta 
[16:55:15] <thcipriani>	 and that's a change from today
[16:56:09] <thcipriani>	 this is the same as: https://phabricator.wikimedia.org/T236114
[16:56:45] <hashar>	 :-(
[16:56:51] * hashar has PTSD
[16:56:54] <hashar>	 :B
[16:57:25] <thcipriani>	 the problem with that one is it went on for a day before we found the issue, this has been going on an hour. If we stop everything now we could lose an hour of work.
[16:57:39] <thcipriani>	 or at least we'd have less to correct if we can correct it easily
[16:57:49] <thcipriani>	 ^ arnaudb mutante 
[16:57:55] <logmsgbot>	 !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1210 (T392806)', diff saved to https://phabricator.wikimedia.org/P75729 and previous config saved to /var/cache/conftool/dbconfig/20250430-165754-fceratto.json
[16:58:05] <hashar>	 so we stop Gerrit to prevent further diverting?
[16:58:15] <logmsgbot>	 !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1245.eqiad.wmnet with reason: Maintenance
[16:59:26] <sobanski>	 Catching up on the scroll back
[16:59:45] <hashar>	 the thing I don't get is Timo claims 1138508 got merged but I don't see those events in the Zuul logs
[16:59:55] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise and row/rack awareness T391392 T390100 - bking@cumin2002 - T390100
[17:00:01] <stashbot>	 T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392
[17:00:01] <stashbot>	 T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1700)
[17:00:18] <Krinkle>	 (CR) TrainBranchBot: [C:+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: Pppery)
[17:00:19] <Krinkle>	 (Merged) jenkins-bot: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: Pppery)
[17:00:24] <Krinkle>	 !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1138922|missing.php: Redesign to match current error pages (T113114)]], [[gerrit:1138508|missing.php: Check for auth.wikimedia.org domain on missing.php (T391994)]]
[17:00:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:00:29] <stashbot>	 T113114: Make all wiki-facing error pages consistent - https://phabricator.wikimedia.org/T113114
[17:00:30] <hashar>	 but I do see the backports in https://sal.toolforge.org/production?p=0&q=1138508&d=
[17:00:30] <stashbot>	 T391994: Auth.wikimedia.org displays wrong message about host header - https://phabricator.wikimedia.org/T391994
[17:00:33] <hashar>	 ah ok 
[17:00:44] * hashar digs Zuul logs more
[17:01:06] <wikibugs>	 (03PS1) 10AOkoth: vrts: add junk queue count and remove mobile queue [puppet] - 10https://gerrit.wikimedia.org/r/1140207
[17:01:17] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10781355 (10ArthurPSmith) Hmm, it seems to have resolved now. Maybe I'll try another o...
[17:01:18] <hashar>	  1138508,7> in gate-and-submit>, actions: [<GerritReporter connection: gerrit://gerrit>]
[17:01:22] <arnaudb>	 coming back
[17:01:34] <thcipriani>	 if you do: sudo -u gerrit2 git log 90167f46357593f19e0a5ad8fea8469b0a66a018
[17:01:39] <hashar>	 because I was grepping zuul.log.2025-04* and not zuul.log
[17:01:49] <thcipriani>	 in /srv/gerrit/git/operations/mediawiki-config.git on gerrit1003
[17:01:53] <hashar>	 Krinkle: data loss confirmed, thank you :)
[17:02:03] <thcipriani>	 you can see "Change has been successfully rebased and submitted"
[17:02:06] <thcipriani>	 for that change
[17:02:17] <hashar>	 thcipriani: do we shut down Gerrit right now?
[17:02:24] <thcipriani>	 arnaudb: can you confirm if the rsync we ran had the --delete flag?
[17:02:45] <thcipriani>	 if not, yes, we should shut down gerrit
[17:03:50] <sobanski>	 I'd say let's shut down anyway and we can dig into it afterwards
[17:04:00] <thcipriani>	 ^ sounds good, let's do it
[17:04:19] <hashar>	 and disable Puppet
[17:04:34] <arnaudb>	 yes thcipriani 
[17:04:38] <arnaudb>	 confirmed
[17:04:53] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[17:05:11] <arnaudb>	 both instances are shutting down
[17:05:24] <arnaudb>	 we'll investigate from here, lets jump back on the call we were in if you want to
[17:05:29] <thcipriani>	 sounds good
[17:05:31] <hashar>	 cause Puppet will bring the systemd unit back up
[17:05:39] <hashar>	 there is some ensure => running 
[17:05:48] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[17:09:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: git_pull_charts.service on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:10:42] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job gerrit-metrics in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:13:19] <thcipriani>	 !log gerrit incident following switchover https://phabricator.wikimedia.org/T393034
[17:13:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:14:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:14:42] <hashar>	 We have shutdown Gerrit to prevent further issues
[17:15:42] <jinxer-wm>	 FIRING: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:16:31] <hashar>	 Krinkle: funnily I have a browser tab that shows the parent change  and it shows the change you mentioned as merged
[17:16:36] <hashar>	 so I did not even had to look in the logs
[17:19:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:21:00] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/blunderbuss: apply
[17:22:21] <logmsgbot>	 !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/blunderbuss: apply
[17:22:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[17:24:38] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:27:51] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise and row/rack awareness T391392 T390100 - bking@cumin2002 - T390100
[17:27:57] <stashbot>	 T391392: Use profile::netbox::host instead of regex.yaml for Cirrussearch rack/row awareness - https://phabricator.wikimedia.org/T391392
[17:27:57] <stashbot>	 T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100
[17:31:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.network.provision for device lsw1-f1-codfw.mgmt.codfw.wmnet
[17:32:00] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[17:32:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise pkg as I forgot last time T390100 - bking@cumin2002 - T390100
[17:33:01] <stashbot>	 T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100
[17:34:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:34:34] <sukhe>	 on-calls are standing by if we can help.
[17:34:36] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:39:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:42:53] <arnaudb>	 thanks sukhe investigation is still in progress to figure out the root cause
[17:43:02] <arnaudb>	 feel free to highlight me directly for live update 
[17:43:06] <wikibugs>	 06SRE, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06Traffic, and 5 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10781527 (10ArthurPSmith) Nope - new property frozen also as soon as I added an exampl...
[17:46:15] <wikibugs>	 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 10Sustainability (Incident Followup): sessionstore workload observability - https://phabricator.wikimedia.org/T392182#10781572 (10Tgr) MediaWiki version: {T393038}
[17:49:15] <wikibugs>	 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042 (10RobH) 03NEW
[17:49:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:49:36] <wikibugs>	 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10781610 (10RobH)
[17:50:50] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10781616 (10wiki_willy) Hi @MatthewVernon - I still have some CapEx underrun, so we could bump up the refresh...
[17:53:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[17:54:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:56:12] <taavi>	 where can one follow along the gerrit incident? the usual IRC channels I'd guessed are all relatively silent
[17:57:00] <arnaudb>	 mping you the meet room
[17:57:26] <arnaudb>	 https://docs.google.com/document/d/1kh6vYGLdGIEpN-EsUaXb6u82gNW5TvBkoI_yCPjB6_8/edit?tab=t.0
[17:57:35] <arnaudb>	 here is the doc
[17:57:48] <arnaudb>	 we're still investigating around the root cause
[18:00:04] <jouncebot>	 hashar and dduvall: Deploy window MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T1800)
[18:00:19] <hashar>	 the train is blocked 
[18:00:27] <hashar>	 due to Gerrit being frozen
[18:00:35] <logmsgbot>	 !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host vrts1003.eqiad.wmnet
[18:00:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new madvise pkg as I forgot last time T390100 - bking@cumin2002 - T390100
[18:00:45] <stashbot>	 T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100
[18:02:40] <wikibugs>	 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044 (10RobH) 03NEW
[18:03:01] <wikibugs>	 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config I 1P Test Host - https://phabricator.wikimedia.org/T393044#10781683 (10RobH)
[18:07:22] <logmsgbot>	 !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1003.eqiad.wmnet
[18:14:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:15:38] <arnaudb>	 Small update: we're still narrowing down what happened to make sure service interruption won't occur again after it is considered fixed.
[18:16:39] <wikibugs>	 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045 (10RobH) 03NEW
[18:17:00] <wikibugs>	 10ops-codfw, 06DC-Ops: Q4:rack/setup/install Dell Config K 1P Test Host - https://phabricator.wikimedia.org/T393045#10781718 (10RobH)
[18:18:19] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for madalina - https://phabricator.wikimedia.org/T392893#10781722 (10Madalina) @tappof I had an access is denied error before. Everything seems ok now, thank you!
[18:19:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:31:49] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10781750 (10RobH)
[18:33:34] <AwesomeAasim>	 What happened to Gerrit?
[18:33:44] <arnaudb>	 there is an incident in progress
[18:34:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:34:40] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:34:51] <jinxer-wm>	 FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:39:02] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: ms-be1060 crashed, then went into an exception in the uEFI pre-boot environment - https://phabricator.wikimedia.org/T392796#10781758 (10RobH) Please note we have two open procurement requests for this host.  Please do NOT discuss pric...
[18:39:51] <jinxer-wm>	 RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures
[18:48:48] <jinxer-wm>	 FIRING: PuppetFailure: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[18:49:43] <sukhe>	 ^ a bunch of stuff is failing because of the Gerrit thing
[18:49:52] <sukhe>	 https://puppetboard.wikimedia.org/nodes?status=failed
[18:50:39] <sukhe>	 so nothing to worry as such, it's expected
[18:50:56] <denisse>	 Yes, I've silenced alerts for the o11y hosts.
[18:51:01] <sukhe>	 thanks!
[18:51:07] <denisse>	 sukhe: Do you think I should extend the silence for all hosts?
[18:51:18] <denisse>	 Puppet is going to fail on many hosts that can't communicate with Gerrit.
[18:51:31] <sukhe>	 denisse: I would say no I think, since they are not paging plus we miss some other related alerts in case we don't remove the silence
[18:51:46] <sukhe>	 if something pages (I doubt anything does?) we can do it
[18:52:00] <denisse>	 SGTM, yes, the alert is not paging.
[18:52:52] <denisse>	 I only silenced the Puppet Failure alerts for the o11y hosts.
[18:53:15] <sukhe>	 yeah
[18:53:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:00:39] <wikibugs>	 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10781822 (10wiki_willy) Thanks @tappof, that sounds good!   >>! In T387231#10780327, @tappof wrote: > @wiki_willy, please take...
[19:06:40] <logmsgbot>	 pt1979@cumin2002 provision (PID 4091458) is awaiting input
[19:14:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:16:58] <icinga-wm>	 PROBLEM - gerrit process on gerrit2002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[19:18:31] <jinxer-wm>	 FIRING: [4x] ProbeDown: Service gerrit2002:29418 has failed probes (tcp_gerrit_ssh_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:19:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:19:44] <logmsgbot>	 !log arnaudb@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on gerrit[1003,2002-2003].wikimedia.org with reason: Debugging
[19:22:13] <wikibugs>	 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10781850 (10ABran-WMF)
[19:28:11] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1015.eqiad.wmnet with OS bullseye
[19:28:22] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye
[19:28:43] <jinxer-wm>	 FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:31:00] <wikibugs>	 10SRE-swift-storage, 06Commons: Multiple files returns "File not found: /v1/AUTH_mw/wikipedia-commons-local-public" error instead of showing correct file - https://phabricator.wikimedia.org/T393049#10781886 (10Pppery)
[19:43:42] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1015.eqiad.wmnet with reason: host reimage
[19:47:07] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1015.eqiad.wmnet with reason: host reimage
[19:51:26] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host ms-fe1016.eqiad.wmnet with OS bullseye
[19:51:36] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host ms-fe1016.eqiad.wmnet with OS bullseye
[19:52:06] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781922 (10VRiley-WMF)
[19:57:15] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.network.provision (exit_code=0) for device lsw1-f1-codfw.mgmt.codfw.wmnet
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T2000).
[20:00:04] <jouncebot>	 _Gerges: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:56] <_Gerges>	 Here
[20:03:14] <taavi>	 i believe the current gerrit outage means that the window is cancelled
[20:03:56] <hashar>	 yes sorry we can not deploy currently
[20:04:20] <hashar>	 we had some data issue with our Gerrit instance and operations/mediawiki-config has been hit
[20:04:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:04:27] <hashar>	 _Gerges: ^
[20:04:54] <hashar>	 _Gerges: it is better to schedule later. I am not sure whether many people will be around tomorrow though due to May 1st
[20:05:26] <_Gerges>	 OK 
[20:07:19] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe1016.eqiad.wmnet with reason: host reimage
[20:09:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:10:20] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:10:54] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe1016.eqiad.wmnet with reason: host reimage
[20:11:04] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:11:04] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1015.eqiad.wmnet with OS bullseye
[20:11:11] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781949 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-fe1015.eqiad.wmnet with OS bullseye completed: - ms-fe1015 (**WARN**...
[20:14:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:16:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:18:43] <jinxer-wm>	 FIRING: [3x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld
[20:19:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:24:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:28:47] <logmsgbot>	 !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:28:48] <jinxer-wm>	 FIRING: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:29:25] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:30:39] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[20:30:42] <jinxer-wm>	 RESOLVED: [4x] JobUnavailable: Reduced availability for job gerrit in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:31:52] <logmsgbot>	 vriley@cumin1002 reimage (PID 1240354) is awaiting input
[20:33:08] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - vriley@cumin1002"
[20:33:09] <logmsgbot>	 !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe1016.eqiad.wmnet with OS bullseye
[20:33:15] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781993 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host ms-fe1016.eqiad.wmnet with OS bullseye completed: - ms-fe1016 (**PASS**...
[20:33:48] <jinxer-wm>	 RESOLVED: [2x] PuppetFailure: Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:34:25] <jinxer-wm>	 RESOLVED: [4x] SystemdUnitFailed: helm-chartctl-package-all.service on chartmuseum1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:36:17] <wikibugs>	 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe101[56] - https://phabricator.wikimedia.org/T388886#10781997 (10VRiley-WMF) 05Open→03Resolved This is complete.
[20:43:19] <wikibugs>	 (03PS1) 10Bvibber: Fix localization for validation errors checking tabular data [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140228 (https://phabricator.wikimedia.org/T389126)
[20:43:45] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in codfw - https://phabricator.wikimedia.org/T393053 (10RobH) 03NEW
[20:44:17] <wikibugs>	 (03PS1) 10Bvibber: Check for content validity before extracting license [extensions/JsonConfig] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140229 (https://phabricator.wikimedia.org/T389125)
[20:45:37] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in codfw - https://phabricator.wikimedia.org/T393053#10782038 (10RobH) a:03akosiaris Alex,  We didn't get racking details on the ordering task T392715, so we need to get them from you before the hosts arrive.  Please populate the task de...
[20:45:40] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in codfw - https://phabricator.wikimedia.org/T393053#10782042 (10RobH)
[20:47:38] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/JsonConfig] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140229 (https://phabricator.wikimedia.org/T389125) (owner: 10Bvibber)
[20:47:55] <wikibugs>	 (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, May 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/Chart] (wmf/1.44.0-wmf.27) - 10https://gerrit.wikimedia.org/r/1140228 (https://phabricator.wikimedia.org/T389126) (owner: 10Bvibber)
[20:52:29] <wikibugs>	 10ops-eqiad, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-eqiad - https://phabricator.wikimedia.org/T393053#10782066 (10RobH)
[20:53:15] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054 (10RobH) 03NEW
[20:53:19] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054#10782084 (10RobH)
[20:53:57] <wikibugs>	 10ops-codfw, 06DC-Ops, 06serviceops: Q4:rack/setup/install (4) aux-k8 in ops-codfw - https://phabricator.wikimedia.org/T393054#10782085 (10RobH) a:03akosiaris Alex,  We didn't get racking details on the ordering task T392714, so we need to get them from you before the hosts arrive.  Please populate the tas...
[21:00:05] <jouncebot>	 Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T2100)
[21:11:10] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new knn plugin - bking@cumin2002 - T390100
[21:11:12] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new knn plugin - bking@cumin2002 - T390100
[21:11:15] <stashbot>	 T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100
[21:12:20] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100
[21:14:39] <wikibugs>	 (03PS7) 10Hashar: missing.php: Check for auth.wikimedia.org domain on missing.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery)
[21:15:12] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100
[21:15:13] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster relforge: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100
[21:16:15] <wikibugs>	 (03PS1) 10Hashar: Review access change [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140241
[21:16:54] <wikibugs>	 (03PS2) 10Hashar: Allow force push to reconstruct repo [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140241 (https://phabricator.wikimedia.org/T393034)
[21:17:04] <wikibugs>	 (03CR) 10Hashar: [V:03+2 C:03+2] Allow force push to reconstruct repo [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140241 (https://phabricator.wikimedia.org/T393034) (owner: 10Hashar)
[21:18:51] <wikibugs>	 (03PS1) 10Hashar: Revert "Allow force push to reconstruct repo" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140242 (https://phabricator.wikimedia.org/T393034)
[21:18:52] <wikibugs>	 (03CR) 10Pppery: "(Noting for the record: this change was approved and deployed by Krinkle using scap backport about 5 hours ago, however the data about tha" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1138508 (https://phabricator.wikimedia.org/T391994) (owner: 10Pppery)
[21:19:01] <wikibugs>	 (03CR) 10Hashar: [V:03+2 C:03+2] Revert "Allow force push to reconstruct repo" [mediawiki-config] (refs/meta/config) - 10https://gerrit.wikimedia.org/r/1140242 (https://phabricator.wikimedia.org/T393034) (owner: 10Hashar)
[21:22:08] <wikibugs>	 (03CR) 10Hashar: "Due to a split brain between Gerrit instances (T393034) this commit was merged against a wrong version of the branch but has never been de" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140206 (https://phabricator.wikimedia.org/T386222) (owner: 10TrainBranchBot)
[21:22:47] <jinxer-wm>	 FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate wikifeeds.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire
[21:23:27] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10782143 (10RobH) a:05RobH→03cmooney @cmooney :    >        "Created by: mmariscalmata The following has been completed: >  > Retrieve package #1...
[21:27:07] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10782147 (10cmooney) Thanks @RobH.  It looks good so far, this is the graph we need to keep an eye on:  https://grafana.wikimedia.org/goto/SVEEkIbHR...
[21:27:28] <hashar>	 !log Deployment server: reseted /srv/mediawiki-staging to 7a3327588 / https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1138508 # T393034
[21:27:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:27:33] <stashbot>	 T393034: Investigate out of date refs following gerrit switchover - https://phabricator.wikimedia.org/T393034
[21:28:06] <hashar>	 the other deployment server might need a sync
[21:29:07] <wikibugs>	 (03PS1) 10Hashar: (DO NOT SUBMIT) test CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140247
[21:30:55] <hashar>	 so zuul-merger seems happy hopefully
[21:31:59] <wikibugs>	 (03Abandoned) 10Hashar: (DO NOT SUBMIT) test CI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140247 (owner: 10Hashar)
[21:36:23] <wikibugs>	 (03PS1) 10Dzahn: gerrit: remove gerrit2002 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250
[21:37:03] <wikibugs>	 (03CR) 10Thcipriani: [C:03+1] gerrit: remove gerrit2002 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (owner: 10Dzahn)
[21:37:50] <wikibugs>	 (03PS2) 10Dzahn: gerrit: remove gerrit2002 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (https://phabricator.wikimedia.org/T236114)
[21:38:55] <wikibugs>	 (03PS3) 10Dzahn: gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (https://phabricator.wikimedia.org/T236114)
[21:39:11] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/1140250 (https://phabricator.wikimedia.org/T236114) (owner: 10Dzahn)
[21:39:17] <wikibugs>	 (03CR) 10Cwhite: [C:03+2] statsd: remove ferm rule for statsd port 8125 [puppet] - 10https://gerrit.wikimedia.org/r/1135076 (https://phabricator.wikimedia.org/T228380) (owner: 10Cwhite)
[21:40:05] <mutante>	 cwhite: we have a merge conflict
[21:40:09] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply new wmf-opensearch-search-plugins version 1.3.20-4~bullseye - bking@cumin2002 - T390100
[21:40:15] <stashbot>	 T390100: Build and deploy updated opensearch plugins deb - https://phabricator.wikimedia.org/T390100
[21:40:27] <cwhite>	 uh oh
[21:40:52] <mutante>	 can you merge both?
[21:40:57] <mutante>	 or just yours? either is fine
[21:41:06] <cwhite>	 mine just completed
[21:41:15] <mutante>	 cool, I see it. thanks
[21:41:24] <cwhite>	 <3
[21:46:57] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1015 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[21:47:43] <jinxer-wm>	 FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[21:48:59] <icinga-wm>	 RECOVERY - gerrit process on gerrit2002 is OK: PROCS OK: 1 process with regex args ^/usr/lib/jvm/java-17-openjdk-amd64/bin/java .*-jar /var/lib/gerrit2/review_site/bin/gerrit.war daemon -d /var/lib/gerrit2/review_site https://wikitech.wikimedia.org/wiki/Gerrit
[21:51:58] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:52:08] <jinxer-wm>	 FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:52:19] <wikibugs>	 (03CR) 10Umherirrender: "recheck after gerrit failover" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1140208 (https://phabricator.wikimedia.org/T392848) (owner: 10Elukey)
[21:52:24] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:52:30] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:52:35] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:52:46] <jinxer-wm>	 FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:53:02] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:53:07] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:53:31] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service gerrit2002:29418 has failed probes (tcp_gerrit_ssh_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gerrit2002:29418 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:53:42] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:57:03] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:57:08] <jinxer-wm>	 RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2025:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:57:19] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:57:30] <jinxer-wm>	 RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:57:35] <jinxer-wm>	 FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:57:46] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:57:57] <jinxer-wm>	 RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:58:07] <jinxer-wm>	 FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[21:58:17] <jinxer-wm>	 RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS
[22:00:04] <jouncebot>	 Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250430T2200)
[22:01:58] <jinxer-wm>	 RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[22:02:09] <jinxer-wm>	 RESOLVED: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[22:18:46] <wikibugs>	 (03CR) 10BCornwall: [C:03+2] Temporarily remove lucaswerkmeister-wmde SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1140219 (owner: 10Lucas Werkmeister (WMDE))
[22:18:49] <wikibugs>	 (03PS1) 10Dzahn: Revert "gerrit: remove gerrit2002 and gerrit2003 from ssh_allowed_hosts" [puppet] - 10https://gerrit.wikimedia.org/r/1140251
[22:35:27] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: prometheus-postfix-exporter.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:04:12] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to <Superset> for <SCampos-WMF> - https://phabricator.wikimedia.org/T393066 (10SCampos-WMF) 03NEW
[23:16:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:28:43] <jinxer-wm>	 FIRING: [148x] SystemdUnitFailed: curator_actions_apifeatureusage_codfw.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:41:05] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140260
[23:41:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140260 (owner: 10TrainBranchBot)
[23:51:49] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate mostlinked job to kubernetes [puppet] - 10https://gerrit.wikimedia.org/r/1140214 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)
[23:52:28] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1140260 (owner: 10TrainBranchBot)
[23:53:14] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mw::maintenance: migrate all remaining general updatequerypages jobs [puppet] - 10https://gerrit.wikimedia.org/r/1140216 (https://phabricator.wikimedia.org/T388534) (owner: 10Hnowlan)