[00:08:11] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179291 [00:08:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179291 (owner: 10TrainBranchBot) [00:31:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1179291 (owner: 10TrainBranchBot) [00:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:58:37] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:00:42] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [01:12:28] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 11m 45s) [01:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:01:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [02:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:10:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [02:19:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [02:37:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [02:55:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [03:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [03:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [03:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:19:59] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-c2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402043#11092720 (10phaultfinder) [04:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:59:33] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:05:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179141 (https://phabricator.wikimedia.org/T385482) (owner: 10Wangombe) [05:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:54] (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: also convert abuse networks to use x-provenance [puppet] - 10https://gerrit.wikimedia.org/r/1175990 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [05:26:26] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092761 (10Joe) This task seems to coalesce different issues; having said that, we've had quite a few abusers lately that used old versions of Chrome as their user agent, so we had t... [05:31:19] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092766 (10Joe) I should also add - if this has happened in the last week, that might be connected to T400119 - if users have any browser extension that modifies their user-agent, fo... [05:33:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179239 (https://phabricator.wikimedia.org/T402047) (owner: 10Dreamrimmer) [05:41:04] (03CR) 10Giuseppe Lavagetto: Remove blocked-nets from varnish (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [05:41:08] (03CR) 10Giuseppe Lavagetto: [C:03+2] Remove blocked-nets from varnish [puppet] - 10https://gerrit.wikimedia.org/r/1175991 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [05:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:59:12] 10ops-eqiad, 10Cloud-Services, 06DC-Ops: Outsdanding diff on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T402157 (10ayounsi) 03NEW p:05Triage→03High The #Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/83... [06:00:08] 10ops-eqiad, 06cloud-services-team, 10Data-Services, 06DC-Ops: Outsdanding diff on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T402157#11092797 (10ayounsi) [06:00:55] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092799 (10Josve05a) I just noticed a pattern and wanted to surface it. For context: in the VRT (the volunteer helpdesk) we normally only see one such “Wikipedia inaccessible” email... [06:01:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [06:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:14:33] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:19:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [06:20:12] (03CR) 10Jgiannelos: [C:03+1] mobileapps: Change max_body_size to 2mb from the 100kb default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179235 (https://phabricator.wikimedia.org/T398838) (owner: 10Arlolra) [06:22:47] 10ops-eqiad, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Outsdanding diff on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T402157#11092819 (10taavi) [06:37:13] (03PS2) 10Majavah: P:wmcs: novaproxy: Add missing IPv6 listen statements [puppet] - 10https://gerrit.wikimedia.org/r/1179254 [06:37:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:40:13] (03CR) 10Majavah: [C:03+2] P:wmcs: novaproxy: Add missing IPv6 listen statements [puppet] - 10https://gerrit.wikimedia.org/r/1179254 (owner: 10Majavah) [06:47:24] 10ops-codfw, 10ops-eqiad, 06SRE, 06DC-Ops: Fix PXE miss-configurations - https://phabricator.wikimedia.org/T396717#11092831 (10ayounsi) ` ssh cumin1003.eqiad.wmnet cumin1003:~$ sudo spicerack-shell --live ` `name=first case: >>> spicerack.redfish('ml-serve1008').get_primary_mac() Management Password: spi... [06:47:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:48:31] (03CR) 10Ayounsi: [C:03+1] Update Cumin aliases to handle the transition to routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179111 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [06:53:37] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:06] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092835 (10Joe) >>! In T402142#11092799, @Josve05a wrote: > I just noticed a pattern and wanted to surface it. For context: in the VRT (the volunteer helpdesk) we normally only see o... [07:00:05] Amir1, Urbanecm, and awight: UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T0700). Please do the needful. [07:00:05] kart_ and DreamRimmer: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:16] Here. I'll start with my patch. [07:00:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179141 (https://phabricator.wikimedia.org/T385482) (owner: 10Wangombe) [07:01:45] (03Merged) 10jenkins-bot: Make MT limit 80% on Welch Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179141 (https://phabricator.wikimedia.org/T385482) (owner: 10Wangombe) [07:02:14] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1179141|Make MT limit 80% on Welch Wikipedia (T385482)]] [07:02:18] T385482: Make MT limit 80% in Welsh Wikipedia - https://phabricator.wikimedia.org/T385482 [07:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [07:05:15] (03PS5) 10Giuseppe Lavagetto: varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) [07:06:23] (03PS2) 10Huei Tan: MinT:Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) [07:07:38] (03PS3) 10Huei Tan: MinT:Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) [07:08:10] (03PS4) 10Huei Tan: MinT: Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) [07:09:39] kart_: If you have some free time, could you please deploy my patch? [07:11:37] (03CR) 10Tiziano Fogli: [C:03+1] centrallog: Remove unused debug logging config [puppet] - 10https://gerrit.wikimedia.org/r/1179228 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [07:12:29] ેscap seems slow. [07:12:42] DreamRimmer: sure. I'll ping once my patch is done. [07:12:58] thanks [07:15:28] (03CR) 10Giuseppe Lavagetto: [C:03+2] varnish: stop loading netmaps [puppet] - 10https://gerrit.wikimedia.org/r/1175992 (https://phabricator.wikimedia.org/T396621) (owner: 10Giuseppe Lavagetto) [07:18:55] "Building container images" 18m passed [07:19:44] OK. moves on.. [07:24:21] !log kartik@deploy1003 kartik, wangombe: Backport for [[gerrit:1179141|Make MT limit 80% on Welch Wikipedia (T385482)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:24:26] T385482: Make MT limit 80% in Welsh Wikipedia - https://phabricator.wikimedia.org/T385482 [07:26:02] !log kartik@deploy1003 kartik, wangombe: Continuing with sync [07:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [07:38:14] (03CR) 10Fabfur: [C:03+1] ua_policy: add python-aiohttp to the libraries [puppet] - 10https://gerrit.wikimedia.org/r/1179245 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [07:39:32] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179141|Make MT limit 80% on Welch Wikipedia (T385482)]] (duration: 37m 18s) [07:39:37] T385482: Make MT limit 80% in Welsh Wikipedia - https://phabricator.wikimedia.org/T385482 [07:40:18] DreamRimmer: let's go with your patch. [07:40:28] (03CR) 10Fabfur: [C:03+1] ua_policy: phase 2 [puppet] - 10https://gerrit.wikimedia.org/r/1179246 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [07:40:32] yep [07:41:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179239 (https://phabricator.wikimedia.org/T402047) (owner: 10Dreamrimmer) [07:42:04] (03Merged) 10jenkins-bot: Disable NewUserMessage extension on hiwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179239 (https://phabricator.wikimedia.org/T402047) (owner: 10Dreamrimmer) [07:42:20] Emperor: ms-fe1009 has been struggling for the whole weekend, refusing connections even locally, could you take a look please? [07:42:20] !log kartik@deploy1003 Started scap sync-world: Backport for [[gerrit:1179239|Disable NewUserMessage extension on hiwiki (T402047)]] [07:42:25] T402047: Disabling NewUserMessage extension on hiwiki - https://phabricator.wikimedia.org/T402047 [07:43:41] !log T401633: creating archive index on tlwikisource, zghwiktionary, rkiwiki, minwikibooks and madwikisource [07:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:43:45] T401633: UpdateSearchIndexConfig.php fails with "Named cluster (dnsdisc) is not configured for maintenance operations" - https://phabricator.wikimedia.org/T401633 [07:43:56] (03PS1) 10Giuseppe Lavagetto: requestctl: do not protect against removal of abuse ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1179637 [07:44:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:46:23] !log kartik@deploy1003 kartik, dreamrimmer: Backport for [[gerrit:1179239|Disable NewUserMessage extension on hiwiki (T402047)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:46:35] checking [07:47:08] DreamRimmer: possible to test the patch on mwdebug? I guess it is difficult or not possible. [07:47:34] looks good https://hi.wikipedia.org/w/api.php?action=query&format=json&meta=siteinfo&formatversion=2&siprop=extensions [07:48:50] Nice [07:49:20] !log kartik@deploy1003 kartik, dreamrimmer: Continuing with sync [07:50:08] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11092939 (10A_smart_kitten) >* Detailed error message the users are receiving - it is included at the bottom of the error page. Interestingly (IMO), from the screenshots shared in thi... [07:56:41] !log kartik@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179239|Disable NewUserMessage extension on hiwiki (T402047)]] (duration: 14m 21s) [07:56:46] T402047: Disabling NewUserMessage extension on hiwiki - https://phabricator.wikimedia.org/T402047 [07:56:48] DreamRimmer: Done! [07:57:26] kart_: Thanks for your time :) [07:58:43] (03CR) 10Giuseppe Lavagetto: [C:03+2] requestctl: do not protect against removal of abuse ipblocks [puppet] - 10https://gerrit.wikimedia.org/r/1179637 (owner: 10Giuseppe Lavagetto) [08:02:20] (03CR) 10Filippo Giunchedi: "I have changed teams, therefore removing myself from reviewers" [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [08:04:14] (03CR) 10Filippo Giunchedi: [C:03+1] P:toolforge::proxy: Collect network error reports [puppet] - 10https://gerrit.wikimedia.org/r/1178489 (https://phabricator.wikimedia.org/T400994) (owner: 10Majavah) [08:13:10] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:13:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:10] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:33:46] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1111 to an-backup-datanode1041 [08:34:06] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [08:34:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:36:16] (03CR) 10Giuseppe Lavagetto: [C:03+2] ua_policy: add python-aiohttp to the libraries [puppet] - 10https://gerrit.wikimedia.org/r/1179245 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [08:36:39] (03CR) 10Giuseppe Lavagetto: [C:03+2] ua_policy: phase 2 [puppet] - 10https://gerrit.wikimedia.org/r/1179246 (https://phabricator.wikimedia.org/T400119) (owner: 10Giuseppe Lavagetto) [08:37:33] (03PS1) 10Btullis: Add another 10 TB to the legacy dumps pvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179639 (https://phabricator.wikimedia.org/T352650) [08:40:02] (03CR) 10Stevemunene: [C:03+1] "looks good, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179639 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [08:40:08] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1111 to an-backup-datanode1041 - btullis@cumin1003" [08:40:30] (03CR) 10Btullis: [C:03+2] Add another 10 TB to the legacy dumps pvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179639 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [08:42:21] (03Merged) 10jenkins-bot: Add another 10 TB to the legacy dumps pvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179639 (https://phabricator.wikimedia.org/T352650) (owner: 10Btullis) [08:42:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1111 to an-backup-datanode1041 - btullis@cumin1003" [08:42:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:42:54] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1041 on all recursors [08:42:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1041 on all recursors [08:42:58] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1041 [08:44:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1041 [08:44:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1111 to an-backup-datanode1041 [08:44:53] !log hashar@deploy1003 Started deploy [integration/docroot@af6fb25]: dev: Simplify router.php a bit [08:45:02] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/services/mediawiki-dumps-legacy: apply [08:45:06] !log hashar@deploy1003 Finished deploy [integration/docroot@af6fb25]: dev: Simplify router.php a bit (duration: 00m 13s) [08:45:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/mediawiki-dumps-legacy: apply [08:45:42] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1110 to an-backup-datanode1040 [08:46:02] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [08:52:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1110 to an-backup-datanode1040 - btullis@cumin1003" [08:54:08] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1179642 (https://phabricator.wikimedia.org/T402171) [08:56:02] btullis@cumin1003 rename (PID 2548339) is awaiting input [09:03:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1110 to an-backup-datanode1040 - btullis@cumin1003" [09:03:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:03:33] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1040 on all recursors [09:03:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1040 on all recursors [09:03:37] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1040 [09:05:09] (03CR) 10DCausse: [C:03+1] [eventstreams] Bump version 0.18.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178550 (https://phabricator.wikimedia.org/T390140) (owner: 10TChin) [09:05:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1040 [09:05:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1110 to an-backup-datanode1040 [09:12:31] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1109 to an-backup-datanode1039 [09:12:52] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:15:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1046.eqiad.wmnet with OS bookworm [09:18:34] btullis@cumin1003 rename (PID 2551518) is awaiting input [09:19:11] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-backup-datanode1046.eqiad.wmnet with OS bookworm [09:19:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1046.eqiad.wmnet with OS bookworm [09:20:23] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1109 to an-backup-datanode1039 - btullis@cumin1003" [09:21:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1109 to an-backup-datanode1039 - btullis@cumin1003" [09:21:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:21:26] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1039 on all recursors [09:21:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1039 on all recursors [09:21:31] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1039 [09:24:23] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1039 [09:25:02] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1109 to an-backup-datanode1039 [09:28:06] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1108 to an-backup-datanode1038 [09:28:27] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:30:29] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1107 to an-backup-namenode1037 [09:33:03] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1108 to an-backup-datanode1038 - btullis@cumin1003" [09:33:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1108 to an-backup-datanode1038 - btullis@cumin1003" [09:33:22] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:33:22] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1038 on all recursors [09:33:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1038 on all recursors [09:33:26] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1038 [09:33:49] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:34:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1038 [09:34:44] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11093380 (10XXBlackburnXx) >>! In T402142#11092766, @Joe wrote: > I should also add - if this has happened in the last week, that might be connected to T400119 @Joe This seems to be... [09:34:46] (03CR) 10Muehlenhoff: [C:03+2] Update Cumin aliases to handle the transition to routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179111 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:35:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1108 to an-backup-datanode1038 [09:36:43] (03CR) 10Muehlenhoff: [C:03+2] Add ganeti7004 to the routed Ganeti cluster in magru [puppet] - 10https://gerrit.wikimedia.org/r/1178887 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [09:38:07] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1106 to an-backup-datanode1036 [09:38:20] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1107 to an-backup-namenode1037 - btullis@cumin1003" [09:38:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1107 to an-backup-namenode1037 - btullis@cumin1003" [09:38:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:38:25] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1037 on all recursors [09:38:26] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:38:28] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1037 on all recursors [09:38:29] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1037 [09:40:07] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1037 [09:40:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1107 to an-backup-namenode1037 [09:42:37] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1105 to an-backup-namenode1035 [09:42:50] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1106 to an-backup-datanode1036 - btullis@cumin1003" [09:43:10] FIRING: GanetiBGPDown: BGP session down between ganeti7004 and asw1-b4-magru - group - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=asw1-b4-magru:9804&var-bgp_group=&var-bgp_neighbor=ganeti7004 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [09:43:33] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:43:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1106 to an-backup-datanode1036 - btullis@cumin1003" [09:43:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:43:43] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1036 on all recursors [09:43:46] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1036 on all recursors [09:43:46] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1036 [09:45:24] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1046.eqiad.wmnet with reason: host reimage [09:47:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1036 [09:47:42] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1106 to an-backup-datanode1036 [09:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:48:10] FIRING: [3x] GanetiBGPDown: BGP session down between ganeti7004 and asw1-b4-magru - group - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [09:48:51] Question: Is Mediawiki/1.43.0 possibly the restbase instance? Is 128.138.70.66 us? Or is this someone just pretending to be mediawiki? https://turnilo.wikimedia.org/#webrequest_sampled_live/4/N4IgbglgzgrghgGwgLzgFwgewHYgFwgDmAThACYgA0408SqGOAygKZobaFT7YwILUMAWxbIcLfCACiaAMYB6AKoAVAMJUQAMwgI0LYtzwBtUGgCeABwkFhE6sRabJAfWe2NtgAr6sZQyZAyGGJ0LFwCTwAmABENKD0LfABaAEZBS2sQBHQWeJAAXwBdfMpTDMl40k4NBycCYIhnC3QACw04WUZwkFkcNDgIbG5qMEQY [09:48:51] XPwjEHkydDh5WQh+rvkQQupsTDR8TUQoFiLqKAskNH8yqwq0KsINMggRIbDJGH3iZzhCFmxto8xibZ4UC1SQtJbDEDmS4EN4QCbUe4OTrPAhkXKyb73arUKykTAUAgFahIIRLfApAAMFJKF0ylUGtwRD2+UBRIAgiT+APwwMcoPBHnKMJ88MCECRXUkaKgGOwWMZIFxvkkRKyDzJeEp1PWIFhE2MvM0+m+GP5Zw0mn+QnQPMhQsCjjgfF+4DGmVVUMyIjgsAcBR1FkG2BYZGizKeOH8AaDIaY/0BIDB5vyQA= [09:49:29] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1046.eqiad.wmnet with reason: host reimage [09:49:29] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1105 to an-backup-namenode1035 - btullis@cumin1003" [09:49:34] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1104 to an-backup-datanode1034 [09:49:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1105 to an-backup-namenode1035 - btullis@cumin1003" [09:49:50] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:49:50] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1035 on all recursors [09:49:53] And/or how do I found out all our IPs. I know 208. is usually us and 172.16. toolforge... [09:49:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1035 on all recursors [09:49:54] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1035 [09:49:54] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [09:50:28] Mvolz: 128.138.70.66 is university of colorado [09:50:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:50:34] (from whois) [09:51:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1035 [09:51:08] Mediawiki calls from inside the infra (such as restbase or other microservices) would not show up in turnilo as it's edge traffic [09:51:15] Toolforge would [09:51:44] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1105 to an-backup-namenode1035 [09:51:59] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7004.magru.wmnet to cluster magru03 and group B [09:52:07] ah okay. thank you! [09:52:10] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti7004.magru.wmnet to cluster magru03 and group B [09:52:16] Any ideas what this traffic actually is? [09:52:23] Mvolz: Our IP ranges https://wikitech.wikimedia.org/wiki/IP_and_AS_allocations [09:53:01] ok [09:53:39] I see you answered university of colorado... does the user-agent look familiar? Is that something we send elsewhere? [09:53:52] For traffic from the extension the request is made from the browser [09:54:05] so it's not a user-agent I would expect from extension traffic. [09:55:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:55:40] btullis@cumin1003 rename (PID 2556169) is awaiting input [09:57:03] Mvolz: Seems to be calls to /api/rest_v1/data/citation/zotero [09:57:39] Possibly they (mis?)configured their mediawiki instance to hit our zotero api endpoints? [09:58:49] that's actually a request format. - the zotero endpoint isn't publically acessible. But yeah good call it's not a format we use [09:58:51] bot! [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1000) [10:00:29] (but the biggest offender seems to be coming from an AWS end point using a generic node library.) [10:01:17] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti7004.magru.wmnet to cluster magru03 and group B [10:01:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [10:04:36] jmm@cumin2002 addnode (PID 2432065) is awaiting input [10:05:19] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1103 to an-backup-namenode1033 [10:05:35] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1104 to an-backup-datanode1034 - btullis@cumin1003" [10:05:39] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:06:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti7004.magru.wmnet to cluster magru03 and group B [10:06:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1046.eqiad.wmnet with OS bookworm [10:06:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1104 to an-backup-datanode1034 - btullis@cumin1003" [10:06:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:06:24] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1034 on all recursors [10:06:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1034 on all recursors [10:06:29] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1034 [10:07:12] (03PS1) 10Cyndywikime: [Growth] enwiki: Deploy "Add a link" to 100% of users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) [10:07:25] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1244.eqiad.wmnet with reason: Maintenance [10:08:37] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11093465 (10Mvolz) [10:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:09:34] btullis@cumin1003 rename (PID 2556169) is awaiting input [10:11:00] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2179.codfw.wmnet with reason: Maintenance [10:11:22] btullis@cumin1003 rename (PID 2560636) is awaiting input [10:12:29] 10SRE-SLO, 10Citoid, 10VisualEditor, 10Editing-team (Kanban Board): Seperate SLO for requests made from Citoid Extension, possible wmf deployed extension only, vs bots etc. - https://phabricator.wikimedia.org/T345627#11093472 (10Mvolz) [10:13:42] (03CR) 10Hnowlan: [C:03+2] rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [10:15:38] (03Merged) 10jenkins-bot: rest-gateway: use simplified list of rest.php APIs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177966 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [10:15:42] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1103 to an-backup-namenode1033 - btullis@cumin1003" [10:15:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1034 [10:16:01] (03PS10) 10Giuseppe Lavagetto: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) [10:16:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1104 to an-backup-datanode1034 [10:17:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1103 to an-backup-namenode1033 - btullis@cumin1003" [10:17:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:17:11] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-namenode1033 on all recursors [10:17:14] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-namenode1033 on all recursors [10:17:15] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-namenode1033 [10:18:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-namenode1033 [10:18:49] !log btullis@cumin1003 START - Cookbook sre.hosts.rename from an-worker1102 to an-backup-datanode1032 [10:19:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [10:19:09] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [10:19:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1103 to an-backup-namenode1033 [10:20:51] PROBLEM - Host ncredir7003 is DOWN: PING CRITICAL - Packet loss = 100% [10:22:27] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [10:22:34] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [10:24:16] 06SRE, 10SRE-Access-Requests: Superset / LDAP access - https://phabricator.wikimedia.org/T402022#11093510 (10Vgutierrez) p:05Triage→03Medium [10:24:53] btullis@cumin1003 rename (PID 2561583) is awaiting input [10:26:55] moritzm: ncredir7003 being down is related to your ganeti work in magru? [10:27:22] it looks like it, host hast been depooled [10:27:24] *has [10:28:03] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T402171 [10:28:07] T402171: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T402171 [10:28:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Set db2240 with weight 0 T402171', diff saved to https://phabricator.wikimedia.org/P81426 and previous config saved to /var/cache/conftool/dbconfig/20250818-102826-fceratto.json [10:28:38] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1102 to an-backup-datanode1032 - btullis@cumin1003" [10:29:00] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1045.eqiad.wmnet with OS bookworm [10:29:13] vgutierrez: yes, it's depooled, should be up again in a bit and then I'll repool it [10:29:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Remove db2240 from API/vslow/dump T402171', diff saved to https://phabricator.wikimedia.org/P81427 and previous config saved to /var/cache/conftool/dbconfig/20250818-102921-fceratto.json [10:30:28] moritzm: thx :D [10:31:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming an-worker1102 to an-backup-datanode1032 - btullis@cumin1003" [10:31:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:31:24] !log btullis@cumin1003 START - Cookbook sre.dns.wipe-cache an-backup-datanode1032 on all recursors [10:31:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-backup-datanode1032 on all recursors [10:31:27] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1032 [10:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [10:33:37] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru02_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:34:36] btullis@cumin1003 rename (PID 2561583) is awaiting input [10:36:22] !log ladsgroup@cumin1002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-eqiad [10:37:51] (03CR) 10Federico Ceratto: [C:03+2] mariadb: Promote db2240 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1179642 (https://phabricator.wikimedia.org/T402171) (owner: 10Gerrit maintenance bot) [10:39:21] !log Starting s4 codfw failover from db2179 to db2240 - T402171 [10:39:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:26] T402171: Switchover s4 master (db2179 -> db2240) - https://phabricator.wikimedia.org/T402171 [10:39:34] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Outsdanding diff on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T402157#11093541 (10VRiley-WMF) Yes, we were testing some of the ports because we were troubleshooting some of these issues to find out wht was going on with some of... [10:40:18] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-eqiad [10:40:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1032 [10:41:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from an-worker1102 to an-backup-datanode1032 [10:41:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Promote db2240 to s4 primary T402171', diff saved to https://phabricator.wikimedia.org/P81428 and previous config saved to /var/cache/conftool/dbconfig/20250818-104158-fceratto.json [10:42:35] !log dropped _echo_target_page_new in aawiki x1 (T399302) [10:42:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:42:38] T399302: Catalog x1 tables - https://phabricator.wikimedia.org/T399302 [10:44:24] (03PS3) 10Vgutierrez: thanos: Add recording rules for xlab SLOs [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) [10:44:45] (03CR) 10Vgutierrez: thanos: Add recording rules for xlab SLOs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1178891 (https://phabricator.wikimedia.org/T398869) (owner: 10Vgutierrez) [10:46:17] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1044.eqiad.wmnet with OS bookworm [10:46:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'db2179: configure API group', diff saved to https://phabricator.wikimedia.org/P81429 and previous config saved to /var/cache/conftool/dbconfig/20250818-104617-fceratto.json [10:48:42] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Outsdanding diff on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T402157#11093573 (10ayounsi) We can do anything with that port. The most important part is that Netbox reflects what's going on exactly in the DC. Can you make sure... [10:51:22] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1066.eqiad.wmnet [10:54:48] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1045.eqiad.wmnet with reason: host reimage [10:57:47] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Outsdanding diff on cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T402157#11093614 (10VRiley-WMF) Okay, my apologies. I thought this was one of the cables that I was connected to one of the trouble servers. Currently, port 42 on D5... [10:57:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1045.eqiad.wmnet with reason: host reimage [10:59:11] !log installing libxml2 security updates [10:59:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:47] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [11:02:40] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11093631 (10VRiley-WMF) Understood, okay. I will create the other units to mark off in this ticket as we did receive 9 units. [11:03:07] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [11:03:08] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts an-worker1066.eqiad.wmnet [11:03:59] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11093633 (10VRiley-WMF) [11:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [11:04:39] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q1:rack/setup/install es1049-es1057 - https://phabricator.wikimedia.org/T400198#11093635 (10VRiley-WMF) [11:09:53] (03PS7) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:11:20] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [11:11:46] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1044.eqiad.wmnet with reason: host reimage [11:14:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1045.eqiad.wmnet with OS bookworm [11:15:26] (03CR) 10Cyndywikime: "Thus patch is ready for review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179648 (https://phabricator.wikimedia.org/T395524) (owner: 10Cyndywikime) [11:15:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1044.eqiad.wmnet with reason: host reimage [11:15:49] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-codfw [11:15:59] (03PS2) 10Tchanders: Document that IP reveal permissions can't just be reassigned [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1154308 (https://phabricator.wikimedia.org/T396217) [11:17:05] btullis@cumin1003 netbox (PID 2569777) is awaiting input [11:17:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-codfw [11:17:51] (03PS1) 10Stevemunene: Update firewall rules to add dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1179652 (https://phabricator.wikimedia.org/T397298) [11:17:58] (03PS8) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:19:05] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1066 to an-backup-datanode1002 - btullis@cumin1003" [11:19:09] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1043.eqiad.wmnet with OS bookworm [11:22:10] btullis@cumin1003 netbox (PID 2569777) is awaiting input [11:22:38] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1172273 (owner: 10PipelineBot) [11:22:53] (03PS2) 10Stevemunene: dse-k8s: setup the dse-k8s-codfw helmfile.d structure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178827 (https://phabricator.wikimedia.org/T397297) [11:22:53] (03PS1) 10Stevemunene: dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) [11:22:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1066 to an-backup-datanode1002 - btullis@cumin1003" [11:22:59] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:23:54] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1002 [11:25:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1002 [11:25:39] (03CR) 10CI reject: [V:04-1] dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:26:43] (03PS9) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:26:47] !log jmm@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies rolling restart_daemons on A:thanos-fe-eqiad [11:27:19] (03CR) 10Vgutierrez: C:ip_reputation_vendors::datacenter_vendors: Known datacenters (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [11:27:57] (03PS1) 10Mszwarc: Enable IP Reveal on Special:AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179655 [11:28:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-thanos-proxies (exit_code=0) rolling restart_daemons on A:thanos-fe-eqiad [11:28:49] (03PS10) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:29:33] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [11:29:40] jouncebot: nowandnext [11:29:41] No deployments scheduled for the next 1 hour(s) and 30 minute(s) [11:29:41] In 1 hour(s) and 30 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1300) [11:30:21] (03PS11) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:31:01] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-backup-datanode1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:32:26] (03PS12) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:32:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1044.eqiad.wmnet with OS bookworm [11:33:52] (03PS1) 10Ladsgroup: Introduce rights for checking constraints [extensions/WikibaseQualityConstraints] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179657 (https://phabricator.wikimedia.org/T401789) [11:34:36] (03CR) 10Ladsgroup: [C:03+2] Introduce rights for checking constraints [extensions/WikibaseQualityConstraints] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179657 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:34:53] (03PS13) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:35:17] btullis@cumin1003 provision (PID 2574465) is awaiting input [11:37:28] RECOVERY - Host ncredir7003 is UP: PING OK - Packet loss = 0%, RTA = 111.08 ms [11:39:39] (03PS3) 10Ladsgroup: Check permission to check constraints [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) [11:39:39] (03CR) 10Btullis: [C:03+1] Update firewall rules to add dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1179652 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:40:36] (03CR) 10Ladsgroup: [C:03+2] Check permission to check constraints [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:41:26] (03CR) 10CI reject: [V:04-1] Check permission to check constraints [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:42:38] (03CR) 10Btullis: dse-k8s: Add helmfile configuration for dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:43:10] RESOLVED: [2x] GanetiBGPDown: BGP session down between ganeti7004 and asw1-b4-magru - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [11:43:36] (03PS4) 10Ladsgroup: Check permission to check constraints [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) [11:43:40] (03CR) 10Ladsgroup: [C:03+2] Check permission to check constraints [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:44:31] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1043.eqiad.wmnet with reason: host reimage [11:46:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/WikibaseQualityConstraints] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179657 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:46:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [11:47:47] (03PS1) 10Muehlenhoff: Create repository components for Bird version with support for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179660 (https://phabricator.wikimedia.org/T362392) [11:47:47] !log jforrester@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [11:48:08] !log jforrester@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [11:48:38] (03PS14) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:48:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1043.eqiad.wmnet with reason: host reimage [11:49:00] (03Merged) 10jenkins-bot: Introduce rights for checking constraints [extensions/WikibaseQualityConstraints] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179657 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:50:45] (03CR) 10Muehlenhoff: "Not sold on the component name, happy to change to something else." [puppet] - 10https://gerrit.wikimedia.org/r/1179660 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [11:50:45] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1042.eqiad.wmnet with OS bookworm [11:50:55] (03PS1) 10KartikMistry: Update Recommendation API to 2025-07-25-064834-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179661 (https://phabricator.wikimedia.org/T399117) [11:50:56] (03PS2) 10Stevemunene: dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) [11:52:00] (03CR) 10Jforrester: [metawiki] Set site name to 'Meta-Wiki', not just 'Meta' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) (owner: 10Jforrester) [11:52:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-backup-datanode1002.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [11:52:36] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) (owner: 10Jforrester) [11:53:23] (03CR) 10CI reject: [V:04-1] dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [11:53:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1002.eqiad.wmnet with OS bookworm [11:54:47] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1067.eqiad.wmnet [11:55:00] (03PS15) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:55:58] (03CR) 10Tchanders: [C:03+1] Enable IP Reveal on Special:AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179655 (owner: 10Mszwarc) [11:56:00] (03Merged) 10jenkins-bot: Check permission to check constraints [extensions/WikibaseMediaInfo] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179658 (https://phabricator.wikimedia.org/T401789) (owner: 10Ladsgroup) [11:56:16] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1179657|Introduce rights for checking constraints (T401789)]], [[gerrit:1179658|Check permission to check constraints (T401789)]] [11:56:19] T401789: Limit Special:ConstraintReport to logged-in users - https://phabricator.wikimedia.org/T401789 [11:57:49] (03PS16) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [11:58:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179655 (owner: 10Mszwarc) [11:58:51] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6613/co" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [12:00:39] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:03:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.017s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:04:11] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:04:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1067.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [12:04:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:04:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker1067.eqiad.wmnet [12:06:10] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1043.eqiad.wmnet with OS bookworm [12:07:42] (03PS1) 10Hnowlan: rest-gateway: allow route definition to reuse clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179663 (https://phabricator.wikimedia.org/T400132) [12:08:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.968s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:08:45] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.25s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:08:50] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [12:09:20] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1041.eqiad.wmnet with OS bookworm [12:14:35] btullis@cumin1003 netbox (PID 2580192) is awaiting input [12:14:58] (03CR) 10Stevemunene: [C:03+2] Update firewall rules to add dse-k8s-codfw [puppet] - 10https://gerrit.wikimedia.org/r/1179652 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [12:15:49] !log fceratto@cumin1002 START - Cookbook sre.mysql.upgrade for db2179.codfw.wmnet [12:15:59] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool db2179 - Upgrading db2179.codfw.wmnet [12:16:18] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2179 - Upgrading db2179.codfw.wmnet [12:16:52] (03CR) 10Abijeet Patro: [C:03+1] MinT: Add stream configuration and registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [12:16:55] (03CR) 10Clément Goubert: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179663 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [12:17:41] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1042.eqiad.wmnet with reason: host reimage [12:18:58] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1067 to an-backup-datanode1003 - btullis@cumin1003" [12:19:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1067 to an-backup-datanode1003 - btullis@cumin1003" [12:19:03] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:19:16] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1002.eqiad.wmnet with reason: host reimage [12:19:35] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1179657|Introduce rights for checking constraints (T401789)]], [[gerrit:1179658|Check permission to check constraints (T401789)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:19:39] T401789: Limit Special:ConstraintReport to logged-in users - https://phabricator.wikimedia.org/T401789 [12:19:45] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1003 [12:20:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1003 [12:21:15] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-backup-datanode1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:21:16] Amir1: works on WikimediaDebug AFAICT [12:21:38] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [12:21:48] thanks. I was double checking [12:22:44] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179669 [12:23:12] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2179.codfw.wmnet [12:23:14] !log sudo puppet cert clean push-notifications.discovery.wmnet - T402183 [12:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:17] T402183: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://phabricator.wikimedia.org/T402183 [12:23:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1042.eqiad.wmnet with reason: host reimage [12:23:49] hm, I don’t see wbcheckconstraints requests coming out of commons though? I might be doing something wrong [12:24:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-backup-datanode1003.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [12:25:49] (03CR) 10Ayounsi: [C:03+1] "that's fine, iirc it's only temporary until bird 2.18 is released ?" [puppet] - 10https://gerrit.wikimedia.org/r/1179660 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [12:26:22] ok, it works on another file 🤷 [12:26:33] I get constraint checks on https://commons.wikimedia.org/wiki/File:PNG_Test.png but not on https://commons.wikimedia.org/wiki/File:CSD_Berlin_2019_-_Lucas_Werkmeister_-_24_-_Bi,_Pan,_Ace_Flags.jpg [12:27:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1002.eqiad.wmnet with reason: host reimage [12:27:55] (03PS2) 10Stevemunene: dse-k8s: add dse-k8s-codfw hosts to LVS [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) [12:28:19] (03PS1) 10Filippo Giunchedi: openstack: acquire cfssl certs for libvirt [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) [12:28:31] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.234s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:28:47] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate push-notifications.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:30:04] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6614/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.415s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:31:44] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6615/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:32:54] (03PS1) 10Clément Goubert: mw-parsoid: Scale down mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179682 [12:33:33] (03CR) 10Filippo Giunchedi: [V:03+1] "The diff is larger than I thought due to indentation changes. Tested in Pontoon to make sure nothing was obviously wrong." [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:33:33] (03PS3) 10Anzx: eswiki, commons, wikidatawiki: IP cap lift for wikipedia workshop on 2025-August-23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179155 (https://phabricator.wikimedia.org/T401745) [12:34:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179155 (https://phabricator.wikimedia.org/T401745) (owner: 10Anzx) [12:34:23] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179657|Introduce rights for checking constraints (T401789)]], [[gerrit:1179658|Check permission to check constraints (T401789)]] (duration: 38m 07s) [12:34:28] T401789: Limit Special:ConstraintReport to logged-in users - https://phabricator.wikimedia.org/T401789 [12:35:19] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1041.eqiad.wmnet with reason: host reimage [12:36:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178990 (https://phabricator.wikimedia.org/T399455) (owner: 10Zabe) [12:36:56] !log installing apache2 security updates [12:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:05] (03Merged) 10jenkins-bot: Reduce default recentchanges query time on large wikis to 1 day [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1178990 (https://phabricator.wikimedia.org/T399455) (owner: 10Zabe) [12:37:23] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1178990|Reduce default recentchanges query time on large wikis to 1 day (T399455)]] [12:37:27] T399455: Change default recentchanges query time on large wikis - https://phabricator.wikimedia.org/T399455 [12:37:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:38:55] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool db2179* gradually with 4 steps - Upgrade MariaDB [12:38:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1041.eqiad.wmnet with reason: host reimage [12:40:15] FIRING: MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus/ops&viewPanel=19 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:40:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1042.eqiad.wmnet with OS bookworm [12:41:09] !log ladsgroup@deploy1003 zabe, ladsgroup: Backport for [[gerrit:1178990|Reduce default recentchanges query time on large wikis to 1 day (T399455)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:42:41] (03CR) 10Majavah: [C:03+2] P:toolforge::proxy: Collect network error reports [puppet] - 10https://gerrit.wikimedia.org/r/1178489 (https://phabricator.wikimedia.org/T400994) (owner: 10Majavah) [12:42:58] !log ladsgroup@deploy1003 zabe, ladsgroup: Continuing with sync [12:45:02] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:45:12] (03PS5) 10Krinkle: mediawiki: Fix non-redirecting download.mediawiki.org domain [puppet] - 10https://gerrit.wikimedia.org/r/1174872 (https://phabricator.wikimedia.org/T50517) [12:45:15] FIRING: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:45:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.527s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:45:38] (03CR) 10Muehlenhoff: "Well kinda, 2.18 won't appear in Debian stable releases for another two years, but if we update the version of bird2 for all uses, we also" [puppet] - 10https://gerrit.wikimedia.org/r/1179660 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [12:45:39] (03CR) 10Muehlenhoff: [C:03+2] Create repository components for Bird version with support for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179660 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [12:48:07] btullis@cumin1003 reimage (PID 2577409) is awaiting input [12:49:59] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1178990|Reduce default recentchanges query time on large wikis to 1 day (T399455)]] (duration: 12m 36s) [12:50:03] T399455: Change default recentchanges query time on large wikis - https://phabricator.wikimedia.org/T399455 [12:50:15] RESOLVED: [2x] MediaWikiMemcachedHighErrorRate: MediaWiki memcached error rate is elevated globally - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiMemcachedHighErrorRate [12:52:06] (03CR) 10Filippo Giunchedi: [V:03+1] "I'm also not 100% sure about the hiera location, please let me know!" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [12:55:56] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1041.eqiad.wmnet with OS bookworm [12:56:04] (03PS1) 10Jdrewniak: Catalog Extension:ReadingLists tables [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) [12:58:16] (03CR) 10CI reject: [V:04-1] Catalog Extension:ReadingLists tables [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) (owner: 10Jdrewniak) [12:58:17] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [12:58:18] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1002.eqiad.wmnet with OS bookworm [12:58:49] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1040.eqiad.wmnet with OS bookworm [12:59:10] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1039.eqiad.wmnet with OS bookworm [12:59:44] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1003.eqiad.wmnet with OS bookworm [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1300). [13:00:05] _Gerges, EggRoll97, James_F, Msz2001, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] * James_F waves. [13:00:17] o/ [13:00:24] o/ [13:00:25] o/ [13:00:31] I’m in a meeting but could deploy soon probably [13:00:37] Ditto. [13:01:29] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1068.eqiad.wmnet [13:02:50] OK, I'm free. [13:03:03] Let's do them all at once. [13:03:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) (owner: 10Jforrester) [13:03:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179655 (owner: 10Mszwarc) [13:03:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179155 (https://phabricator.wikimedia.org/T401745) (owner: 10Anzx) [13:03:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179242 (https://phabricator.wikimedia.org/T401070) (owner: 10GergesShamon) [13:03:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179264 (https://phabricator.wikimedia.org/T401350) (owner: 10EggRoll97) [13:04:20] (03CR) 10Btullis: dse-k8s: Add helmfile configuration for dse-k8s-codfw (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [13:04:34] (03Merged) 10jenkins-bot: [metawiki] Set site name to 'Meta-Wiki', not just 'Meta' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170339 (https://phabricator.wikimedia.org/T399843) (owner: 10Jforrester) [13:04:36] (03Merged) 10jenkins-bot: Enable IP Reveal on Special:AbuseLog [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179655 (owner: 10Mszwarc) [13:04:38] (03Merged) 10jenkins-bot: eswiki, commons, wikidatawiki: IP cap lift for wikipedia workshop on 2025-August-23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179155 (https://phabricator.wikimedia.org/T401745) (owner: 10Anzx) [13:04:42] (03Merged) 10jenkins-bot: [zhwikisource] Set noindex,nofollow for namespaces User and User Talk [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179242 (https://phabricator.wikimedia.org/T401070) (owner: 10GergesShamon) [13:04:44] (03Merged) 10jenkins-bot: Add Oath log to bureaucrats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179264 (https://phabricator.wikimedia.org/T401350) (owner: 10EggRoll97) [13:04:49] (03CR) 10Ladsgroup: "the CI is failing because the tables are added out of order of sources, it should be added after property_suggester tables." [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) (owner: 10Jdrewniak) [13:04:58] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1170339|[metawiki] Set site name to 'Meta-Wiki', not just 'Meta' (T399843)]], [[gerrit:1179655|Enable IP Reveal on Special:AbuseLog]], [[gerrit:1179155|eswiki, commons, wikidatawiki: IP cap lift for wikipedia workshop on 2025-August-23 (T401745)]], [[gerrit:1179242|[zhwikisource] Set noindex,nofollow for namespaces User and User Talk (T401070)]], [[ [13:04:58] gerrit:1179264|Add Oath log to bureaucrats (T401350)]] [13:05:06] T399843: Set sitename for metawiki to 'Meta-Wiki', not just 'Meta' - https://phabricator.wikimedia.org/T399843 [13:05:06] T401745: Lift IP cap for 161.132.238.4 on 2025-08-23 - https://phabricator.wikimedia.org/T401745 [13:05:07] T401070: noindex all pages in User namespace in Chinese Wikisource - https://phabricator.wikimedia.org/T401070 [13:05:07] T401350: Bureaucrats should be able to access Special:Log/oath - https://phabricator.wikimedia.org/T401350 [13:06:23] btullis@cumin1003 decommission (PID 2591810) is awaiting input [13:06:49] !log jforrester@deploy1003 mszwarc, eggroll97, gergesshamon, anzx, jforrester: Backport for [[gerrit:1170339|[metawiki] Set site name to 'Meta-Wiki', not just 'Meta' (T399843)]], [[gerrit:1179655|Enable IP Reveal on Special:AbuseLog]], [[gerrit:1179155|eswiki, commons, wikidatawiki: IP cap lift for wikipedia workshop on 2025-August-23 (T401745)]], [[gerrit:1179242|[zhwikisource] Set noindex,nofollow for namespaces User an [13:06:49] d User Talk (T401070)]], [[gerrit:1179264|Add Oath log to bureaucrats (T401350)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:07:00] Okie-dokie, let's check these work as expected. [13:07:11] Msz2001, anzx: Can you check yours? [13:07:20] Mine works fine [13:07:26] !log (cont.) User Talk (T401070)]], [[gerrit:1179264|Add Oath log to bureaucrats (T401350)]] synced to the testservers [13:07:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:09] James_F: nothing to check on throttle, good to sync [13:08:25] Excellent, I think we're good to go on the other three. [13:08:28] !log jforrester@deploy1003 mszwarc, eggroll97, gergesshamon, anzx, jforrester: Continuing with sync [13:10:34] (03Abandoned) 10Jforrester: Clean up wmgWikibaseSiteGroup list, alpha-sort and de-dupe [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1170565 (owner: 10Jforrester) [13:11:54] (03CR) 10Ssingh: [C:03+1] "You will need to add to assign an IP for k8s-ingress-dse.svc.codfw.wmnet in netbox (see https://wikitech.wikimedia.org/wiki/LVS#DNS_change" [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [13:13:26] <_Gerges> Here [13:13:33] <_Gerges> Here [13:13:46] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1170339|[metawiki] Set site name to 'Meta-Wiki', not just 'Meta' (T399843)]], [[gerrit:1179655|Enable IP Reveal on Special:AbuseLog]], [[gerrit:1179155|eswiki, commons, wikidatawiki: IP cap lift for wikipedia workshop on 2025-August-23 (T401745)]], [[gerrit:1179242|[zhwikisource] Set noindex,nofollow for namespaces User and User Talk (T401070)]], [ [13:13:47] [gerrit:1179264|Add Oath log to bureaucrats (T401350)]] (duration: 08m 48s) [13:13:48] OK, all done. Deploy window complete. [13:13:53] T399843: Set sitename for metawiki to 'Meta-Wiki', not just 'Meta' - https://phabricator.wikimedia.org/T399843 [13:13:54] T401745: Lift IP cap for 161.132.238.4 on 2025-08-23 - https://phabricator.wikimedia.org/T401745 [13:13:54] T401070: noindex all pages in User namespace in Chinese Wikisource - https://phabricator.wikimedia.org/T401070 [13:13:55] T401350: Bureaucrats should be able to access Special:Log/oath - https://phabricator.wikimedia.org/T401350 [13:14:00] _Gerges: We deployed your patch in your absence, don't worry. [13:14:06] !log imported bird2 2.17.1+branch.mq.bgp.multilisten.c47b08a1524c-cznic.1 into component/bird-routed-ganeti for Bookworm T362392 [13:14:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:10] T362392: Routed Ganeti: Add support for VM BGP - https://phabricator.wikimedia.org/T362392 [13:14:37] <_Gerges> Thank you all [13:15:12] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:19:26] (03PS2) 10Jdrewniak: Catalog Extension:ReadingLists tables [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) [13:20:14] (03CR) 10CDanis: [C:03+1] P:toolforge::proxy: Collect network error reports [puppet] - 10https://gerrit.wikimedia.org/r/1178489 (https://phabricator.wikimedia.org/T400994) (owner: 10Majavah) [13:20:31] !log fceratto@cumin1002 DONE (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 8:00:00 on db2219.codfw.wmnet with reason: Maintenance [13:20:43] (03CR) 10Jgiannelos: [C:03+1] mw-parsoid: Scale down mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179682 (owner: 10Clément Goubert) [13:20:57] btullis@cumin1003 decommission (PID 2591810) is awaiting input [13:22:08] (03PS1) 10Ssingh: wikimedia.ee: remove ncredir parking [dns] - 10https://gerrit.wikimedia.org/r/1179687 [13:24:15] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1040.eqiad.wmnet with reason: host reimage [13:24:28] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1068.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [13:24:34] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2179* gradually with 4 steps - Upgrade MariaDB [13:25:41] (03CR) 10Vgutierrez: [C:03+1] wikimedia.ee: remove ncredir parking [dns] - 10https://gerrit.wikimedia.org/r/1179687 (owner: 10Ssingh) [13:25:47] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1003.eqiad.wmnet with reason: host reimage [13:26:19] (03CR) 10Ssingh: [C:03+2] wikimedia.ee: remove ncredir parking [dns] - 10https://gerrit.wikimedia.org/r/1179687 (owner: 10Ssingh) [13:26:55] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1039.eqiad.wmnet with reason: host reimage [13:27:00] !log sukhe@dns1004 START - running authdns-update [13:27:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1068.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [13:27:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:27:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker1068.eqiad.wmnet [13:28:01] !log sukhe@dns1004 END - running authdns-update [13:29:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1040.eqiad.wmnet with reason: host reimage [13:29:28] (03PS1) 10Ssingh: hiera: ncmonitor: add wikimedia.ee to ignored_domains [puppet] - 10https://gerrit.wikimedia.org/r/1179688 [13:30:10] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6616/co" [puppet] - 10https://gerrit.wikimedia.org/r/1179688 (owner: 10Ssingh) [13:30:18] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:32:09] (03CR) 10Vgutierrez: [C:03+1] "looks good to me but let's wait for Brett" [puppet] - 10https://gerrit.wikimedia.org/r/1179688 (owner: 10Ssingh) [13:32:12] (03PS1) 10Muehlenhoff: Add a parameter to the Bird class to install the component enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) [13:32:19] (03PS3) 10Btullis: dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [13:32:39] (03CR) 10Hashar: "> Php::Extension[apc]: has no parameter named 'shm_size'" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:32:43] (03CR) 10CI reject: [V:04-1] Add a parameter to the Bird class to install the component enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [13:32:46] (03PS12) 10Hashar: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:33:00] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1039.eqiad.wmnet with reason: host reimage [13:34:53] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:35:10] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:35:25] FIRING: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:46] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [13:36:54] (03PS2) 10Muehlenhoff: Add a parameter to the Bird class to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) [13:37:03] (03PS17) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [13:37:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1003.eqiad.wmnet with reason: host reimage [13:37:28] (03CR) 10CI reject: [V:04-1] Add a parameter to the Bird class to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [13:38:00] (03PS3) 10Muehlenhoff: Add a parameter to the Bird class to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) [13:38:49] (03CR) 10Hashar: "The Puppet compile infra has an issue of some sort :-\" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:39:51] btullis: ^ cookbook needs your input to unblock DNS changes. thanks :) [13:40:25] RESOLVED: SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:41:22] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [13:42:16] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1068 to an-backup-datanode1004 - btullis@cumin1003" [13:42:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1068 to an-backup-datanode1004 - btullis@cumin1003" [13:42:20] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:44:14] (03PS18) 10Slyngshede: C:ip_reputation_vendors::datacenter_vendors: Known datacenters [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) [13:44:25] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-codfw [13:44:53] (03CR) 10Hnowlan: [C:03+1] mw-parsoid: Scale down mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179682 (owner: 10Clément Goubert) [13:45:00] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6618/console" [puppet] - 10https://gerrit.wikimedia.org/r/1178866 (https://phabricator.wikimedia.org/T398161) (owner: 10Slyngshede) [13:45:02] (03CR) 10Btullis: [C:03+1] dse-k8s: Add helmfile configuration for dse-k8s-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179654 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [13:45:31] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1004 [13:45:40] (03CR) 10Ssingh: Add a parameter to the Bird class to install the Bird enabled for routed Ganeti (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [13:46:01] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1040.eqiad.wmnet with OS bookworm [13:46:57] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1004 [13:47:30] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1038.eqiad.wmnet with OS bookworm [13:47:43] (03PS1) 10Hashar: Add missing passwords::mysql::phabricator::phd_user [labs/private] - 10https://gerrit.wikimedia.org/r/1179693 [13:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [13:47:58] (03CR) 10Hashar: [V:03+2 C:03+2] Add missing passwords::mysql::phabricator::phd_user [labs/private] - 10https://gerrit.wikimedia.org/r/1179693 (owner: 10Hashar) [13:49:03] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:49:10] jouncebot: nowandnext [13:49:11] For the next 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1300) [13:49:11] In 0 hour(s) and 40 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1430) [13:49:32] (03CR) 10Hnowlan: [C:03+2] rest-gateway: allow route definition to reuse clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179663 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [13:49:36] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-backup-datanode1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [13:49:39] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1039.eqiad.wmnet with OS bookworm [13:51:50] (03Merged) 10jenkins-bot: rest-gateway: allow route definition to reuse clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179663 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [13:52:44] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:54:19] (03PS1) 10Dbrant: mariadb: Document WikimediaEditorTasks tables. [puppet] - 10https://gerrit.wikimedia.org/r/1179695 (https://phabricator.wikimedia.org/T399302) [13:55:13] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/1175916/4716/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:55:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [13:55:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1003.eqiad.wmnet with OS bookworm [13:56:05] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:56:15] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:56:18] (03PS13) 10Hashar: phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [13:56:55] btullis@cumin1003 provision (PID 2601664) is awaiting input [13:57:27] (03CR) 10Filippo Giunchedi: [V:03+1] "Also note that live migration testing will happen in codfw1dev post-merge, eqiad is not affected" [puppet] - 10https://gerrit.wikimedia.org/r/1179678 (https://phabricator.wikimedia.org/T355145) (owner: 10Filippo Giunchedi) [13:58:01] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [14:00:37] (03CR) 10Herron: [C:03+1] prometheus::alert::rule: use title to deduplicate resources [puppet] - 10https://gerrit.wikimedia.org/r/1178883 (https://phabricator.wikimedia.org/T381665) (owner: 10Tiziano Fogli) [14:01:27] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-codfw [14:01:30] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:01:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [14:02:20] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:02:32] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:03:11] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-backup-datanode1004.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:03:38] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1004.eqiad.wmnet with OS bookworm [14:03:50] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:sessionstore: Upgrading to Java 11.0.28 - eevans@cumin1002 [14:05:08] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-backup-namenode1037 to an-backup-datanode1037 - btullis@cumin1003" [14:05:25] (03PS1) 10Hnowlan: rest-gateway: correct naming of rest.php cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179696 (https://phabricator.wikimedia.org/T400132) [14:05:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-backup-namenode1037 to an-backup-datanode1037 - btullis@cumin1003" [14:05:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:06:15] (03CR) 10Herron: [C:03+1] nrpe wrapper: define Prometheus alerts via Puppet resources [puppet] - 10https://gerrit.wikimedia.org/r/1174729 (https://phabricator.wikimedia.org/T395446) (owner: 10Tiziano Fogli) [14:06:30] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1037 [14:06:37] (03PS4) 10Muehlenhoff: Bird: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) [14:06:49] (03CR) 10Muehlenhoff: Bird: Add a parameter to install the Bird enabled for routed Ganeti (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [14:07:08] (03CR) 10Herron: [C:03+1] centrallog: Remove unused debug logging config [puppet] - 10https://gerrit.wikimedia.org/r/1179228 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [14:07:33] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1037 [14:08:14] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/1175916/4717/phab1004.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [14:08:21] (03CR) 10Hashar: [C:03+1] phabricator: increase APCu shared memory segment size [puppet] - 10https://gerrit.wikimedia.org/r/1175916 (https://phabricator.wikimedia.org/T401157) (owner: 10Dzahn) [14:08:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [14:09:05] (03CR) 10Btullis: [C:03+1] cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [14:09:07] !log herron@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling restart_daemons on A:kafka-logging-eqiad [14:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:10:25] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-backup-datanode1037.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:11:44] (03CR) 10TChin: [C:03+2] [eventstreams] Bump version 0.18.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178550 (https://phabricator.wikimedia.org/T390140) (owner: 10TChin) [14:13:04] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1038.eqiad.wmnet with reason: host reimage [14:13:04] (03PS1) 10KartikMistry: Content Translation: Remove unused configuration parameter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) [14:13:42] (03Merged) 10jenkins-bot: [eventstreams] Bump version 0.18.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1178550 (https://phabricator.wikimedia.org/T390140) (owner: 10TChin) [14:14:32] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts an-worker1069.eqiad.wmnet [14:15:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, August 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) (owner: 10KartikMistry) [14:15:22] (03CR) 10Ssingh: [C:03+1] Bird: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [14:15:51] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179698 (https://phabricator.wikimedia.org/T400671) (owner: 10KartikMistry) [14:18:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1038.eqiad.wmnet with reason: host reimage [14:21:00] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [14:24:47] (03CR) 10Bking: [C:03+2] cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1178613 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [14:26:01] (03PS1) 10Ebernhardson: flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 [14:26:20] btullis@cumin1003 provision (PID 2605066) is awaiting input [14:26:46] btullis@cumin1003 decommission (PID 2605282) is awaiting input [14:26:46] (03PS2) 10Ebernhardson: flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 [14:27:00] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1004.eqiad.wmnet with reason: host reimage [14:28:26] (03CR) 10Hnowlan: [C:03+2] rest-gateway: correct naming of rest.php cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179696 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [14:28:29] !log herron@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling restart_daemons on A:kafka-logging-eqiad [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1430) [14:30:08] (03Merged) 10jenkins-bot: rest-gateway: correct naming of rest.php cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179696 (https://phabricator.wikimedia.org/T400132) (owner: 10Hnowlan) [14:32:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1004.eqiad.wmnet with reason: host reimage [14:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:33:21] (03PS1) 10Bking: Revert "cirrussearch: Fix logstash/log4j config" [puppet] - 10https://gerrit.wikimedia.org/r/1179703 [14:33:31] (03CR) 10Bking: [V:03+2 C:03+2] Revert "cirrussearch: Fix logstash/log4j config" [puppet] - 10https://gerrit.wikimedia.org/r/1179703 (owner: 10Bking) [14:34:19] (03CR) 10Muehlenhoff: [C:03+2] Bird: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179689 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [14:35:31] (03PS3) 10Jdrewniak: Catalog Extension:ReadingLists tables [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) [14:35:40] (03CR) 10Ladsgroup: [C:03+2] Catalog Extension:ReadingLists tables [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) (owner: 10Jdrewniak) [14:35:41] (03CR) 10Ladsgroup: [V:03+2 C:03+2] Catalog Extension:ReadingLists tables [puppet] - 10https://gerrit.wikimedia.org/r/1179683 (https://phabricator.wikimedia.org/T399302) (owner: 10Jdrewniak) [14:35:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1038.eqiad.wmnet with OS bookworm [14:36:54] btullis@cumin1003 provision (PID 2605066) is awaiting input [14:37:15] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:37:24] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:37:49] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [14:38:02] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [14:39:09] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [14:39:17] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [14:39:26] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:sessionstore: Upgrading to Java 11.0.28 - eevans@cumin1002 [14:39:45] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [14:40:22] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [14:41:47] (03PS3) 10Ebernhardson: flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 [14:42:26] (03PS1) 10Muehlenhoff: ganeti-routed: Enable bird component for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179706 (https://phabricator.wikimedia.org/T362392) [14:43:13] (03CR) 10Santiago Faci: MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [14:43:24] (03CR) 10Vgutierrez: varnish: refactor inclusion of requestctl rules (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [14:43:24] (03CR) 10Clément Goubert: [C:03+2] mw-parsoid: Scale down mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179682 (owner: 10Clément Goubert) [14:44:06] (03PS1) 10Bking: cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1179708 (https://phabricator.wikimedia.org/T395571) [14:44:53] (03Merged) 10jenkins-bot: mw-parsoid: Scale down mw-parsoid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179682 (owner: 10Clément Goubert) [14:45:18] !log cgoubert@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [14:45:29] !log cgoubert@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [14:45:34] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [14:45:36] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179706 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [14:45:40] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [14:46:03] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [14:46:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, August 19 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [14:46:14] (03PS1) 10Andrew Bogott: eqiad1 cloudceph: upgrade one osd and one mon node to ceph 'quincy' [puppet] - 10https://gerrit.wikimedia.org/r/1179709 (https://phabricator.wikimedia.org/T402190) [14:46:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179708 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [14:47:12] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [14:47:20] (03PS1) 10Scott French: deployment_server: switch mw-debug/next to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1177420 (https://phabricator.wikimedia.org/T401254) [14:48:20] (03CR) 10Clément Goubert: [C:03+1] k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [14:48:44] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [14:49:23] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [14:49:37] (03PS1) 10Ayounsi: Add all Nokia switches to Rancid [puppet] - 10https://gerrit.wikimedia.org/r/1179711 [14:50:30] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [14:50:38] (03PS1) 10Muehlenhoff: Also update tracked email address [puppet] - 10https://gerrit.wikimedia.org/r/1179712 (https://phabricator.wikimedia.org/T401882) [14:51:41] (03CR) 10Bking: [C:03+2] cirrussearch: Fix logstash/log4j config [puppet] - 10https://gerrit.wikimedia.org/r/1179708 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [14:51:53] btullis@cumin1003 reimage (PID 2604575) is awaiting input [14:52:06] (03CR) 10Bking: [C:03+2] "self-merging, as this change was reviewed in I587235fb9937958115bb3637ca7a29028e801211 ." [puppet] - 10https://gerrit.wikimedia.org/r/1179708 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [14:52:33] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2151.codfw.wmnet with reason: Maintenance [14:52:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T399249)', diff saved to https://phabricator.wikimedia.org/P81439 and previous config saved to /var/cache/conftool/dbconfig/20250818-145240-fceratto.json [14:52:45] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [14:53:08] (03CR) 10Clément Goubert: [C:03+1] deployment_server: switch mw-debug/next to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1177420 (https://phabricator.wikimedia.org/T401254) (owner: 10Scott French) [14:53:19] (03PS1) 10Ladsgroup: common.yaml: Remove private tables that have been cataloged [puppet] - 10https://gerrit.wikimedia.org/r/1179714 (https://phabricator.wikimedia.org/T399302) [14:53:23] (03PS1) 10Clément Goubert: mw-debug: switch php.version to 8.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1177421 (https://phabricator.wikimedia.org/T401254) (owner: 10Scott French) [14:53:32] (03CR) 10Alexandros Kosiaris: [C:03+1] deployment_server: switch mw-debug/next to PHP 8.3 [puppet] - 10https://gerrit.wikimedia.org/r/1177420 (https://phabricator.wikimedia.org/T401254) (owner: 10Scott French) [14:53:35] (03CR) 10CI reject: [V:04-1] common.yaml: Remove private tables that have been cataloged [puppet] - 10https://gerrit.wikimedia.org/r/1179714 (https://phabricator.wikimedia.org/T399302) (owner: 10Ladsgroup) [14:54:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399249)', diff saved to https://phabricator.wikimedia.org/P81440 and previous config saved to /var/cache/conftool/dbconfig/20250818-145450-fceratto.json [14:54:51] (03CR) 10Scott French: [C:03+1] k8s-ops: add disk space check overrides [alerts] - 10https://gerrit.wikimedia.org/r/1179177 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [14:55:59] (03PS2) 10Ladsgroup: common.yaml: Remove private tables that have been cataloged [puppet] - 10https://gerrit.wikimedia.org/r/1179714 (https://phabricator.wikimedia.org/T399302) [14:59:27] (03CR) 10Andrew Bogott: [C:03+2] eqiad1 cloudceph: upgrade one osd and one mon node to ceph 'quincy' [puppet] - 10https://gerrit.wikimedia.org/r/1179709 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [14:59:56] !log tchin@deploy1003 helmfile [staging] START helmfile.d/services/eventstreams: apply [15:00:07] !log tchin@deploy1003 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [15:00:21] !log tchin@deploy1003 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [15:01:09] !log tchin@deploy1003 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [15:01:58] (03CR) 10Herron: [C:03+1] logstash: remove udp in error alerts [alerts] - 10https://gerrit.wikimedia.org/r/1179221 (owner: 10Cwhite) [15:03:03] (03PS1) 10Bking: Revert "cirrussearch: Fix logstash/log4j config" [puppet] - 10https://gerrit.wikimedia.org/r/1179718 [15:03:15] (03CR) 10Ayounsi: [C:03+1] ganeti-routed: Enable bird component for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179706 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [15:03:22] (03CR) 10Herron: [C:03+1] DiskSpace: add DiskSpace critical alert [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [15:03:37] FIRING: [3x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/3 (Core: lsw1-e4-codfw:ethernet-1/55 {#130117100037}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [15:04:19] (03CR) 10Herron: [C:03+1] resources: Exclude docker|containerd|kubelet mounts from alerts [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [15:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [15:05:01] !log tchin@deploy1003 helmfile [codfw] START helmfile.d/services/eventstreams: apply [15:05:27] jouncebot: nowandnext [15:05:28] No deployments scheduled for the next 0 hour(s) and 24 minute(s) [15:05:28] In 0 hour(s) and 24 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1530) [15:05:37] (03Abandoned) 10Bking: Revert "cirrussearch: Fix logstash/log4j config" [puppet] - 10https://gerrit.wikimedia.org/r/1179718 (owner: 10Bking) [15:05:46] !log tchin@deploy1003 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [15:06:03] (03PS1) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) [15:06:15] (03PS1) 10Bking: opensearch: Fix file task [puppet] - 10https://gerrit.wikimedia.org/r/1179720 (https://phabricator.wikimedia.org/T395571) [15:06:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179720 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [15:07:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:07:47] (03PS9) 10Arnaudb: nftables: throttle debugging [puppet] - 10https://gerrit.wikimedia.org/r/1178880 (https://phabricator.wikimedia.org/T400971) [15:07:47] (03CR) 10Arnaudb: "I've been testing the gerrit thresholds to avoid degrading UX upon merge." [puppet] - 10https://gerrit.wikimedia.org/r/1178880 (https://phabricator.wikimedia.org/T400971) (owner: 10Arnaudb) [15:08:37] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81441 and previous config saved to /var/cache/conftool/dbconfig/20250818-150958-fceratto.json [15:12:42] (03CR) 10Bking: [C:03+2] opensearch: Fix file task [puppet] - 10https://gerrit.wikimedia.org/r/1179720 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [15:12:45] FIRING: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:13:14] (03CR) 10Bking: [C:03+2] "self-merging, review took place in I587235fb9937958115bb3637ca7a29028e801211" [puppet] - 10https://gerrit.wikimedia.org/r/1179720 (https://phabricator.wikimedia.org/T395571) (owner: 10Bking) [15:13:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-wmde-users and analytics-privatedata-users for dang - https://phabricator.wikimedia.org/T402191#11094628 (10conny-kawohl_WMDE) I approve this request by @dang (I am the Engineering Manager at Wikibase Suite and am helping out in Sowmya's absence) [15:17:46] !log mszabo Deployed security patch for T400892 [15:21:05] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] mediawiki: Remove unused wikidata.org vhost and fix redirect in beta (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [15:21:58] (03PS1) 10Muehlenhoff: bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) [15:23:37] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:27] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply logging config change - bking@cumin1002 - T395571 [15:24:31] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [15:24:31] (03PS1) 10Stevemunene: dns: Define a DNS A record for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) [15:25:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P81442 and previous config saved to /var/cache/conftool/dbconfig/20250818-152505-fceratto.json [15:26:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [15:26:46] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179724 (https://phabricator.wikimedia.org/T128546) [15:27:18] (03CR) 10Stevemunene: "Marked as active on netbox with the accompanying change on dns here If4d0c7fe4a92ffd4f6b7a0e1ad703b77c40322e9" [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [15:29:24] (03CR) 10Btullis: [C:04-1] "You have only defined the PTR record here, not the A record as specified in the commit message." [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [15:30:05] jan_drewniak: gettimeofday() says it's time for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1530) [15:30:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-backup-datanode1037.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:30:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [15:30:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1004.eqiad.wmnet with OS bookworm [15:31:24] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster relforge: apply logging config change - bking@cumin1002 - T395571 [15:31:26] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1037.eqiad.wmnet with OS bookworm [15:31:27] (03PS1) 10Daimona Eaytoy: Fix type declaration for nonexistent event cache [extensions/CampaignEvents] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179725 (https://phabricator.wikimedia.org/T401952) [15:31:28] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [15:31:42] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1036.eqiad.wmnet with OS bookworm [15:32:10] (03PS2) 10Stevemunene: dns: Define a DNS PTR record for the dse-k8s-codfw ingress [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) [15:32:39] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply logging config change - bking@cumin1002 - T395571 [15:32:45] RESOLVED: [2x] WidespreadPuppetFailure: Puppet has failed in codfw - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:32:50] Hi folks, I made a backport for T401952 (UBN). Would it be possible to get it deployed? (I'm not a deployer) [15:32:51] T401952: TypeError: MediaWiki\Extension\CampaignEvents\Event\Store\EventStore::MediaWiki\Extension\CampaignEvents\Event\Store\{closure}(): Argument #1 ($oldValue) must be of type MediaWiki\Extension\CampaignEvents\Event\ExistingEventReg - https://phabricator.wikimedia.org/T401952 [15:32:54] T397072 [15:32:55] T397072: Investigate significant changeprop backlogs after PCS migration - https://phabricator.wikimedia.org/T397072 [15:33:03] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1069.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [15:33:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: an-worker1069.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [15:33:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:33:09] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-worker1069.eqiad.wmnet [15:34:02] Daimona: Hi, I have a backport scheduled right now, but I can take a look after I'm done. [15:34:14] That'd be amazing, thank you! [15:36:01] (03CR) 10Jdrewniak: [C:03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179724 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:36:10] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [15:37:07] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179724 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:40:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T399249)', diff saved to https://phabricator.wikimedia.org/P81443 and previous config saved to /var/cache/conftool/dbconfig/20250818-154012-fceratto.json [15:40:17] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2158.codfw.wmnet with reason: Maintenance [15:40:17] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [15:40:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T399249)', diff saved to https://phabricator.wikimedia.org/P81444 and previous config saved to /var/cache/conftool/dbconfig/20250818-154024-fceratto.json [15:41:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399249)', diff saved to https://phabricator.wikimedia.org/P81445 and previous config saved to /var/cache/conftool/dbconfig/20250818-154134-fceratto.json [15:41:58] (03CR) 10Stevemunene: "Updated to reflect this." [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [15:41:58] btullis@cumin1003 netbox (PID 2617697) is awaiting input [15:44:47] (03CR) 10Arlolra: [C:03+2] mobileapps: Change max_body_size to 2mb from the 100kb default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179235 (https://phabricator.wikimedia.org/T398838) (owner: 10Arlolra) [15:46:36] (03Merged) 10jenkins-bot: mobileapps: Change max_body_size to 2mb from the 100kb default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179235 (https://phabricator.wikimedia.org/T398838) (owner: 10Arlolra) [15:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [15:51:41] (03PS4) 10Ebernhardson: flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 [15:53:32] (03CR) 10DCausse: [C:03+1] flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 (owner: 10Ebernhardson) [15:54:24] !log jdrewniak@deploy1003 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1179724| Bumping portals to master (T128546)]] (duration: 07m 34s) [15:54:28] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:54:42] (03PS1) 10FNegri: aptrepo: import wikireplicas-utils from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1179728 (https://phabricator.wikimedia.org/T395266) [15:55:44] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1069 to an-backup-datanode1005 - btullis@cumin1003" [15:55:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-worker1069 to an-backup-datanode1005 - btullis@cumin1003" [15:55:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:56:11] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1005 [15:56:15] !log jdrewniak@deploy1003 Synchronized portals: Wikimedia Portals Update: [[gerrit:1179724| Bumping portals to master (T128546)]] (duration: 01m 49s) [15:56:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81446 and previous config saved to /var/cache/conftool/dbconfig/20250818-155642-fceratto.json [15:57:11] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1037.eqiad.wmnet with reason: host reimage [15:57:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1005 [15:57:44] (03PS2) 10FNegri: aptrepo: import wikireplicas-utils from apt-staging [puppet] - 10https://gerrit.wikimedia.org/r/1179728 (https://phabricator.wikimedia.org/T395266) [15:58:47] (03PS1) 10RLazarus: Revert "shellbox-constraints: Bump replicas from 10 to 20 for traffic increase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179730 [15:59:00] (03CR) 10Btullis: "I think that you need both the A and the PTR records, don't you?" [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [15:59:05] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1036.eqiad.wmnet with reason: host reimage [16:00:06] hey Daimona: I'm ready to backport your patch [16:00:48] Thank you! I'm around but in a call, and not great at multitasking. Just ping me if I seem to have disappeared :) [16:00:49] (03PS2) 10RLazarus: Revert "shellbox-constraints: Bump replicas from 10 to 20 for traffic increase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179730 [16:01:31] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (1 nodes at a time) for ElasticSearch cluster cloudelastic: apply logging config change - bking@cumin1002 - T395571 [16:01:36] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [16:01:54] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-backup-datanode1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:02:46] Daimona: just to verify, this is the patch right? https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CampaignEvents/+/1179725 [16:03:12] Yup [16:03:54] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1037.eqiad.wmnet with reason: host reimage [16:03:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1003 using scap backport" [extensions/CampaignEvents] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179725 (https://phabricator.wikimedia.org/T401952) (owner: 10Daimona Eaytoy) [16:04:03] (03CR) 10Huei Tan: MinT: Add stream configuration and registration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179120 (https://phabricator.wikimedia.org/T397600) (owner: 10Huei Tan) [16:05:31] (03CR) 10Scott French: [C:03+1] Revert "shellbox-constraints: Bump replicas from 10 to 20 for traffic increase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179730 (owner: 10RLazarus) [16:05:45] (03Merged) 10jenkins-bot: Fix type declaration for nonexistent event cache [extensions/CampaignEvents] (wmf/1.45.0-wmf.14) - 10https://gerrit.wikimedia.org/r/1179725 (https://phabricator.wikimedia.org/T401952) (owner: 10Daimona Eaytoy) [16:05:56] (03PS1) 10Andrew Bogott: cloudcephmon1004 back to Pacific [puppet] - 10https://gerrit.wikimedia.org/r/1179731 (https://phabricator.wikimedia.org/T402190) [16:06:04] !log jdrewniak@deploy1003 Started scap sync-world: Backport for [[gerrit:1179725|Fix type declaration for nonexistent event cache (T401952)]] [16:06:09] T401952: TypeError: MediaWiki\Extension\CampaignEvents\Event\Store\EventStore::MediaWiki\Extension\CampaignEvents\Event\Store\{closure}(): Argument #1 ($oldValue) must be of type MediaWiki\Extension\CampaignEvents\Event\ExistingEventReg - https://phabricator.wikimedia.org/T401952 [16:06:34] (03CR) 10Andrew Bogott: [C:03+2] cloudcephmon1004 back to Pacific [puppet] - 10https://gerrit.wikimedia.org/r/1179731 (https://phabricator.wikimedia.org/T402190) (owner: 10Andrew Bogott) [16:07:55] !log jdrewniak@deploy1003 daimona, jdrewniak: Backport for [[gerrit:1179725|Fix type declaration for nonexistent event cache (T401952)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [16:08:30] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1036.eqiad.wmnet with reason: host reimage [16:10:36] daimona: any way to test on mwdeb? or should I go straight to sync? [16:11:01] I'm not sure how to test, I believe you can proceed. I'll keep an eye on logstash. Thanks again! [16:11:08] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-backup-datanode1005.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:11:41] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1005.eqiad.wmnet with OS bookworm [16:11:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P81447 and previous config saved to /var/cache/conftool/dbconfig/20250818-161149-fceratto.json [16:11:50] !log jdrewniak@deploy1003 daimona, jdrewniak: Continuing with sync [16:13:35] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts analytics1070.eqiad.wmnet [16:15:34] (03CR) 10Ssingh: [C:03+1] bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [16:17:18] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11094894 (10Jclark-ctr) ` Hi team. This is Esteban Morales from the EX/QFX advanced team. This ticket was assigned to me. The alarms in question have been addressed in an existing PR.... [16:17:18] !log jdrewniak@deploy1003 Finished scap sync-world: Backport for [[gerrit:1179725|Fix type declaration for nonexistent event cache (T401952)]] (duration: 11m 13s) [16:17:22] T401952: TypeError: MediaWiki\Extension\CampaignEvents\Event\Store\EventStore::MediaWiki\Extension\CampaignEvents\Event\Store\{closure}(): Argument #1 ($oldValue) must be of type MediaWiki\Extension\CampaignEvents\Event\ExistingEventReg - https://phabricator.wikimedia.org/T401952 [16:18:09] (03PS2) 10Muehlenhoff: bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) [16:18:19] (03CR) 10Muehlenhoff: bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [16:18:45] FIRING: Emergency syslog message: Alert for device ssw1-f1-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:20:08] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:20:39] (03CR) 10Ssingh: [C:03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [16:20:42] (03CR) 10Slyngshede: [C:03+1] Also update tracked email address [puppet] - 10https://gerrit.wikimedia.org/r/1179712 (https://phabricator.wikimedia.org/T401882) (owner: 10Muehlenhoff) [16:20:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1037.eqiad.wmnet with OS bookworm [16:21:48] !log bking@cumin1002 START - Cookbook sre.opensearch.roll-restart-reboot rolling restart_daemons on A:datahubsearch [16:22:32] !log btullis@cumin1003 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:22:33] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts analytics1070.eqiad.wmnet [16:25:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1036.eqiad.wmnet with OS bookworm [16:25:54] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [16:26:53] jan_drewniak: still working or all done? I'll sneak out https://gerrit.wikimedia.org/r/1179730 if the floor is clear [16:26:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T399249)', diff saved to https://phabricator.wikimedia.org/P81448 and previous config saved to /var/cache/conftool/dbconfig/20250818-162656-fceratto.json [16:27:01] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [16:27:13] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance [16:27:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T399249)', diff saved to https://phabricator.wikimedia.org/P81449 and previous config saved to /var/cache/conftool/dbconfig/20250818-162720-fceratto.json [16:27:34] Daimona, rzl: yup all done [16:27:38] thanks! [16:27:51] (03CR) 10RLazarus: [C:03+2] Revert "shellbox-constraints: Bump replicas from 10 to 20 for traffic increase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179730 (owner: 10RLazarus) [16:28:45] RESOLVED: Emergency syslog message: Device ssw1-f1-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [16:29:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399249)', diff saved to https://phabricator.wikimedia.org/P81450 and previous config saved to /var/cache/conftool/dbconfig/20250818-162930-fceratto.json [16:29:50] !log bking@cumin1002 END (PASS) - Cookbook sre.opensearch.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:datahubsearch [16:29:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-backup-namenode1035 to an-backup-datanode1035 - btullis@cumin1003" [16:29:56] (03Merged) 10jenkins-bot: Revert "shellbox-constraints: Bump replicas from 10 to 20 for traffic increase" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179730 (owner: 10RLazarus) [16:30:19] I'd like to deploy mobileapps, should I wait until rzl is done? [16:30:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-backup-namenode1035 to an-backup-datanode1035 - btullis@cumin1003" [16:30:31] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:30:38] Thanks! I just checked logstash and can confirm that error rate went to 0. [16:31:05] arlolra: almost certainly no conflict but let me finish up first if you don't mind -- won't be more than a few minutes [16:31:22] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1006 [16:31:47] rzl: no rush [16:31:54] !log rzl@deploy1003 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [16:32:10] !log rzl@deploy1003 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [16:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [16:32:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1006 [16:33:39] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1035 [16:33:45] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1005.eqiad.wmnet with reason: host reimage [16:34:14] !log rzl@deploy1003 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [16:34:18] !log rzl@deploy1003 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [16:34:51] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1035 [16:35:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1034.eqiad.wmnet with OS bookworm [16:35:32] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1035.eqiad.wmnet with OS bookworm [16:36:02] !log btullis@cumin1003 START - Cookbook sre.hosts.provision for host an-backup-datanode1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:36:09] arlolra: all yours, thanks! [16:37:18] Thanks [16:38:44] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [16:39:32] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1005.eqiad.wmnet with reason: host reimage [16:40:09] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply logging config change - bking@cumin1002 - T395571 [16:40:13] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [16:41:31] btullis@cumin1003 provision (PID 2628297) is awaiting input [16:42:55] !log arlolra@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [16:42:57] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply logging config change - bking@cumin1002 - T395571 [16:44:08] !log arlolra@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [16:44:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81451 and previous config saved to /var/cache/conftool/dbconfig/20250818-164437-fceratto.json [16:47:36] btullis@cumin1003 provision (PID 2628297) is awaiting input [16:49:55] !log bking@cumin1002 conftool action : set/weight=10; selector: name=cirrussearch2091. [16:50:04] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply logging config change - bking@cumin1002 - T395571 [16:50:08] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [16:51:13] !log arlolra@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [16:52:04] !log arlolra@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [16:55:10] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [16:56:48] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [16:57:32] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [16:58:15] btullis@cumin1003 reimage (PID 2621586) is awaiting input [16:59:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P81452 and previous config saved to /var/cache/conftool/dbconfig/20250818-165945-fceratto.json [17:00:04] swfrench-wmf: How many deployers does it take to do MediaWiki infrastructure (UTC late) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1700). [17:00:05] ryankemper: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T1700) [17:00:09] o/ [17:00:40] !log arlolra@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:01:12] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1034.eqiad.wmnet with reason: host reimage [17:01:23] dancy: if you're around and available to deploy scap 4.202.0, that would be swell [17:01:38] OK [17:01:57] I've merged https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/203, so that's safe to proceed [17:02:05] !log arlolra@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:02:11] All done [17:02:18] !log dancy@deploy1003 Installing scap version "4.202.0" for 2 host(s) [17:03:00] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1035.eqiad.wmnet with reason: host reimage [17:04:07] !log dancy@deploy1003 Installation of scap version "4.202.0" completed for 2 hosts [17:04:22] swfrench-wmf: done [17:04:55] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1034.eqiad.wmnet with reason: host reimage [17:05:01] dancy: awesome. thank you! I'll get this going momentarily [17:05:23] !log swfrench@deploy1003 Started scap sync-world: Non-deploy scap run to verify image build and dependent helmfile values - T401721 [17:05:28] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [17:08:27] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1035.eqiad.wmnet with reason: host reimage [17:10:43] (03PS1) 10Andrew Bogott: Magnum: configure helm chart repo with hiera [puppet] - 10https://gerrit.wikimedia.org/r/1179736 [17:11:10] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179736 (owner: 10Andrew Bogott) [17:14:21] (03CR) 10Andrew Bogott: [C:03+2] Magnum: configure helm chart repo with hiera [puppet] - 10https://gerrit.wikimedia.org/r/1179736 (owner: 10Andrew Bogott) [17:14:53] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T399249)', diff saved to https://phabricator.wikimedia.org/P81453 and previous config saved to /var/cache/conftool/dbconfig/20250818-171452-fceratto.json [17:14:57] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [17:15:08] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2180.codfw.wmnet with reason: Maintenance [17:15:15] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2180 (T399249)', diff saved to https://phabricator.wikimedia.org/P81454 and previous config saved to /var/cache/conftool/dbconfig/20250818-171515-fceratto.json [17:15:39] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [17:17:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399249)', diff saved to https://phabricator.wikimedia.org/P81455 and previous config saved to /var/cache/conftool/dbconfig/20250818-171725-fceratto.json [17:17:42] RECOVERY - Disk space on an-druid1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1001&var-datasource=eqiad+prometheus/ops [17:19:34] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-backup-datanode1006.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [17:19:36] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [17:19:37] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1005.eqiad.wmnet with OS bookworm [17:21:05] (03PS1) 10David Caro: harbor: add docker-cli [puppet] - 10https://gerrit.wikimedia.org/r/1179737 [17:21:05] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1006.eqiad.wmnet with OS bookworm [17:21:38] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1034.eqiad.wmnet with OS bookworm [17:23:50] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1032.eqiad.wmnet with OS bookworm [17:24:13] !log swfrench@deploy1003 Stopping before sync operations [17:24:22] (03CR) 10Raymond Ndibe: [C:03+1] harbor: add docker-cli [puppet] - 10https://gerrit.wikimedia.org/r/1179737 (owner: 10David Caro) [17:24:34] (03CR) 10David Caro: [C:03+2] harbor: add docker-cli [puppet] - 10https://gerrit.wikimedia.org/r/1179737 (owner: 10David Caro) [17:25:36] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [17:25:47] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1035.eqiad.wmnet with OS bookworm [17:27:32] RECOVERY - Disk space on an-druid1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1002&var-datasource=eqiad+prometheus/ops [17:30:28] !log swfrench@deploy1003 Started scap sync-world: Deploy new images after verifying dependent helmfile values - T401721 [17:30:32] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [17:32:33] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81456 and previous config saved to /var/cache/conftool/dbconfig/20250818-173232-fceratto.json [17:34:19] (03CR) 10Ssingh: [C:03+1] "Looks good and so does the linked DNS change." [puppet] - 10https://gerrit.wikimedia.org/r/1178834 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [17:35:12] (03CR) 10Ssingh: "An A record is certainly required. Once that is fixed, please merge this and run authdns-update to roll it out. Please ping me if I can he" [dns] - 10https://gerrit.wikimedia.org/r/1179723 (https://phabricator.wikimedia.org/T397298) (owner: 10Stevemunene) [17:38:54] (03PS1) 10Herron: pyrra: logstash-requests add version [puppet] - 10https://gerrit.wikimedia.org/r/1179739 [17:43:17] (03CR) 10Herron: [C:03+2] pyrra: logstash-requests add version [puppet] - 10https://gerrit.wikimedia.org/r/1179739 (owner: 10Herron) [17:44:34] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1006.eqiad.wmnet with reason: host reimage [17:47:37] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1178983 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [17:47:40] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P81457 and previous config saved to /var/cache/conftool/dbconfig/20250818-174740-fceratto.json [17:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:48:43] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1006.eqiad.wmnet with reason: host reimage [17:49:22] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1032.eqiad.wmnet with reason: host reimage [17:52:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1032.eqiad.wmnet with reason: host reimage [17:55:02] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1178979 (https://phabricator.wikimedia.org/T332764) (owner: 10Cwhite) [17:55:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1179221 (owner: 10Cwhite) [17:56:04] (03CR) 10Andrea Denisse: [C:03+2] centrallog: Remove unused debug logging config [puppet] - 10https://gerrit.wikimedia.org/r/1179228 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [17:58:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.022s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:01:27] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:01:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [18:02:21] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts analytics1071.eqiad.wmnet [18:02:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T399249)', diff saved to https://phabricator.wikimedia.org/P81458 and previous config saved to /var/cache/conftool/dbconfig/20250818-180247-fceratto.json [18:02:52] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2193.codfw.wmnet with reason: Maintenance [18:02:52] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:03:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2193 (T399249)', diff saved to https://phabricator.wikimedia.org/P81459 and previous config saved to /var/cache/conftool/dbconfig/20250818-180259-fceratto.json [18:03:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.395s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:03:52] !log swfrench@deploy1003 Finished scap sync-world: Deploy new images after verifying dependent helmfile values - T401721 (duration: 36m 38s) [18:03:56] T401721: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721 [18:04:53] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [18:05:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399249)', diff saved to https://phabricator.wikimedia.org/P81460 and previous config saved to /var/cache/conftool/dbconfig/20250818-180509-fceratto.json [18:06:03] btullis@cumin1003 decommission (PID 2646879) is awaiting input [18:06:57] after a bit of a delay, I'm done with the infra window [18:07:12] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [18:07:13] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1006.eqiad.wmnet with OS bookworm [18:07:19] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-backup-namenode1033 to an-backup-datanode1033 - btullis@cumin1003" [18:07:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed an-backup-namenode1033 to an-backup-datanode1033 - btullis@cumin1003" [18:07:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:07:36] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1033 [18:07:55] !log btullis@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-backup-datanode1033 [18:08:58] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1032.eqiad.wmnet with OS bookworm [18:09:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1033.eqiad.wmnet with OS bookworm [18:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:04] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:13:37] FIRING: SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:48] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:14:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics1071.eqiad.wmnet [18:15:02] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [18:15:43] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [18:17:59] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:20:09] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [18:20:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81461 and previous config saved to /var/cache/conftool/dbconfig/20250818-182017-fceratto.json [18:23:02] (03PS1) 10Andrew Bogott: magnum codfw1dev: get initial charts from codfw1dev object storage [puppet] - 10https://gerrit.wikimedia.org/r/1179745 [18:24:33] RESOLVED: SystemdUnitFailed: push_cross_cluster_settings_9400.service on cirrussearch2092:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:25:52] btullis@cumin1003 netbox (PID 2648411) is awaiting input [18:26:09] (03PS3) 10CDanis: [WIP] haproxy: silent-drop as early as possible [puppet] - 10https://gerrit.wikimedia.org/r/1176302 [18:27:59] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on cirrussearch2092 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:28:41] (03CR) 10Andrew Bogott: [C:03+2] magnum codfw1dev: get initial charts from codfw1dev object storage [puppet] - 10https://gerrit.wikimedia.org/r/1179745 (owner: 10Andrew Bogott) [18:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:33:05] (03CR) 10Ejegg: [C:03+1] Remove $wgCentralNoticeESITestString [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1173360 (https://phabricator.wikimedia.org/T400472) (owner: 10R4356thwiki) [18:33:12] !log bking@cumin1002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_codfw: apply logging config change - bking@cumin1002 - T395571 [18:33:17] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [18:33:48] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [18:35:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P81462 and previous config saved to /var/cache/conftool/dbconfig/20250818-183524-fceratto.json [18:35:41] !log stevemunene@cumin1003 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [18:41:44] jouncebot: nowandnext [18:41:44] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [18:41:45] In 1 hour(s) and 18 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2000) [18:48:04] !log swfrench@deploy1003 Started scap sync-world: Test deploy to investigate spurious full builds [18:50:05] !log swfrench@deploy1003 Finished scap sync-world: Test deploy to investigate spurious full builds (duration: 02m 19s) [18:50:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.238s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:50:23] RECOVERY - Disk space on an-druid1004 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-druid1004&var-datasource=eqiad+prometheus/ops [18:50:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T399249)', diff saved to https://phabricator.wikimedia.org/P81464 and previous config saved to /var/cache/conftool/dbconfig/20250818-185031-fceratto.json [18:50:36] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [18:50:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2197.codfw.wmnet with reason: Maintenance [18:50:59] (03PS1) 10CDanis: haproxy: maxconn for varnish threads limit [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) [18:51:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2214.codfw.wmnet with reason: Maintenance [18:51:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2214 (T399249)', diff saved to https://phabricator.wikimedia.org/P81465 and previous config saved to /var/cache/conftool/dbconfig/20250818-185111-fceratto.json [18:51:29] (03PS1) 10Bernard Wang: Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 [18:52:17] (03CR) 10CI reject: [V:04-1] Update vector search config with new wgVectorTypeahead [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179750 (owner: 10Bernard Wang) [18:53:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T399249)', diff saved to https://phabricator.wikimedia.org/P81466 and previous config saved to /var/cache/conftool/dbconfig/20250818-185322-fceratto.json [18:56:46] !log stevemunene@cumin1003 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [18:57:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:02:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:02:47] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed analytics1071 to an-backup-datanode1007 - btullis@cumin1003" [19:02:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renamed analytics1071 to an-backup-datanode1007 - btullis@cumin1003" [19:02:53] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:02:56] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1033 [19:04:01] !log btullis@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host an-backup-datanode1007 [19:04:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1033 [19:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [19:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [19:05:19] !log btullis@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-backup-datanode1007 [19:06:52] (03CR) 10Ssingh: "Looking pretty good. Just one last comment and I think we can merge it." [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [19:07:36] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1007.eqiad.wmnet with OS bookworm [19:08:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81467 and previous config saved to /var/cache/conftool/dbconfig/20250818-190830-fceratto.json [19:10:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.206s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:13:42] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-backup-datanode1033.eqiad.wmnet with OS bookworm [19:14:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host an-backup-datanode1033.eqiad.wmnet with OS bookworm [19:14:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.169s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:15:00] !log bking@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=search,name=eqiad [19:15:09] (03CR) 10Ayounsi: [C:03+1] bird::anycast: Add a parameter to install the Bird enabled for routed Ganeti [puppet] - 10https://gerrit.wikimedia.org/r/1179722 (https://phabricator.wikimedia.org/T362392) (owner: 10Muehlenhoff) [19:16:10] !log btullis@cumin1003 START - Cookbook sre.hosts.decommission for hosts analytics[1072-1077].eqiad.wmnet [19:19:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.169s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:19:18] (03CR) 10Ssingh: haproxy: maxconn for varnish threads limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [19:19:52] btullis@cumin1003 decommission (PID 2654313) is awaiting input [19:22:23] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 55 hosts with reason: T395571 [19:22:27] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [19:23:03] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11095634 (10Jclark-ctr) 05Open→03Resolved @ayounsi @Papaul @cmooney I am closing this ticket since Juniper has advised that the OS must be upgraded to one of the versions listed below in or... [19:23:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214', diff saved to https://phabricator.wikimedia.org/P81469 and previous config saved to /var/cache/conftool/dbconfig/20250818-192337-fceratto.json [19:24:15] !log bking@cumin1002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply logging config change - bking@cumin1002 - T395571 [19:29:53] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099#11095655 (10phaultfinder) [19:30:10] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on an-backup-datanode1007.eqiad.wmnet with reason: host reimage [19:33:21] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-backup-datanode1007.eqiad.wmnet with reason: host reimage [19:35:11] 10ops-eqiad, 06SRE, 06DC-Ops: ssw1-f1-eqiad: Fan Spinning Upgraded - https://phabricator.wikimedia.org/T400783#11095671 (10Papaul) The switch is running ` Junos: 22.2R3.15 ` and the recommended version as for 2025-02-08 is ` 23.4R2-Sx [19:36:39] (03CR) 10CDanis: haproxy: maxconn for varnish threads limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [19:38:24] (03PS1) 10Scott French: Add support for structured provenance patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1179751 (https://phabricator.wikimedia.org/T401430) [19:38:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2214 (T399249)', diff saved to https://phabricator.wikimedia.org/P81470 and previous config saved to /var/cache/conftool/dbconfig/20250818-193844-fceratto.json [19:38:50] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [19:39:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2217.codfw.wmnet with reason: Maintenance [19:39:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2217 (T399249)', diff saved to https://phabricator.wikimedia.org/P81471 and previous config saved to /var/cache/conftool/dbconfig/20250818-193907-fceratto.json [19:39:12] btullis@cumin1003 decommission (PID 2654313) is awaiting input [19:40:07] (03CR) 10Scott French: [V:03+2 C:03+2] Add support for structured provenance patterns [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1179751 (https://phabricator.wikimedia.org/T401430) (owner: 10Scott French) [19:40:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T399249)', diff saved to https://phabricator.wikimedia.org/P81472 and previous config saved to /var/cache/conftool/dbconfig/20250818-194017-fceratto.json [19:42:39] !log btullis@cumin1003 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-backup-datanode1032.eqiad.wmnet [19:42:41] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts an-backup-datanode1032.eqiad.wmnet [19:43:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:44:52] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-test-coord1002.eqiad.wmnet with reason: supermicro [19:45:05] !log btullis@cumin1003 START - Cookbook sre.dns.netbox [19:45:11] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11095772 (10Josve05a) Yet another report Ticket#2025081810008602: * Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36 * E... [19:45:48] !log swfrench@cumin2002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Deploy support for structured provenance patterns - swfrench@cumin2002 - T401430" [19:45:51] !log swfrench@cumin2002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy support for structured provenance patterns - swfrench@cumin2002 - T401430 [19:45:52] T401430: Introduce structured provenance patterns - https://phabricator.wikimedia.org/T401430 [19:46:38] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Deploy support for structured provenance patterns - swfrench@cumin2002 - T401430 [19:46:40] !log swfrench@cumin2002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Deploy support for structured provenance patterns - swfrench@cumin2002 - T401430" [19:48:09] (03CR) 10Ebernhardson: [C:03+2] flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 (owner: 10Ebernhardson) [19:48:41] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [19:50:21] (03Merged) 10jenkins-bot: flink chart: Add a comment label [deployment-charts] - 10https://gerrit.wikimedia.org/r/1179701 (owner: 10Ebernhardson) [19:50:51] btullis@cumin1003 decommission (PID 2654313) is awaiting input [19:51:16] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11095786 (10Josve05a) [19:51:45] btullis@cumin1003 reimage (PID 2653913) is awaiting input [19:54:56] 10ops-codfw, 06DC-Ops: Alert for device ps1-c4-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402221 (10phaultfinder) 03NEW [19:54:58] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099#11095828 (10phaultfinder) [19:55:25] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P81473 and previous config saved to /var/cache/conftool/dbconfig/20250818-195524-fceratto.json [19:59:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [19:59:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2000). [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:03:34] _joe_: re T402142: I'm not sure I can gather better data than what I've now done. It depends on the technical knowledge (both mine and) of the people emailing VRT and their willingness to cooperate. If there's something specific you'd like me to ask them (like trying a special debug link), please let me know. I don't really know what else to collect myself, and I don't want to overwhelm them with overly technical requests that [20:03:34] might make them stop replying. Hope you'll figure it out! [20:03:35] T402142: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142 [20:04:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr3-eqsin and Hurricane Electric (2001:de8:4::6939:1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [20:04:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:xe-0/1/3 (Peering: Equinix (Wikimedia-SG1-IX-00 Singapore, MAC filter) {#1016}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-eqsin:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:37:58] <_joe_> Josve05a: To give you an example, see https://en.wikipedia.org/api/rest_v1/page/data-parsoid (which we've deprecated) [20:37:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399249)', diff saved to https://phabricator.wikimedia.org/P81477 and previous config saved to /var/cache/conftool/dbconfig/20250818-202712-fceratto.json [20:37:58] :_joe_ In the last comment I did ask ("Is there an error code at the bottom of that screen? If you right click and open "Inspect" are there any errors in the "Console"?) they only reply was "This is the error code: Failed to load resource: the server responded with a status of 503 ()" [20:37:58] I've seen such varnish codes many times myself for certain issues, but are they always there is my question? [20:37:58] (03PS2) 10CDanis: haproxy: maxconn for varnish threads limit [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) [20:37:58] given that no-one seem to have send a full screenshot, and only the "message" cropped [20:37:59] <_joe_> ok yhsnkd Josve05a - 503 typicslly means a problem in the application [20:37:59] <_joe_> *thanks [20:37:59] * Josve05a started Googling if yhsnkd was a new technical term.... [20:37:59] <_joe_> so my guess was right - these are errors most likely sent by MediaWiki [20:37:59] (03PS11) 10CDanis: varnish: refactor inclusion of requestctl rules [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [20:37:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1175841 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [20:37:59] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6623/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:38:00] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:38:00] (03CR) 10CDobbins: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6624/console" [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:38:00] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099#11096030 (10phaultfinder) [20:38:00] (03PS8) 10Krinkle: varnish: Allow customising "contact noc@" error [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [20:38:00] (03CR) 10Krinkle: "Rebased to resolve conflict with https://gerrit.wikimedia.org/r/c/operations/puppet/+/1175990. https://phabricator.wikimedia.org/T402156" [puppet] - 10https://gerrit.wikimedia.org/r/1143602 (https://phabricator.wikimedia.org/T393404) (owner: 10Majavah) [20:38:00] (03CR) 10CDanis: haproxy: allow having multiple requestctl scopes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179247 (https://phabricator.wikimedia.org/T399058) (owner: 10Giuseppe Lavagetto) [20:40:39] (03PS1) 10Majavah: hieradata: Add new codfw1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/1179756 [20:41:45] (03CR) 10Ssingh: [C:03+1] "Thanks, I am fine with either but I was mostly curious. 12k is fine to start with too and we can tune as required. Thanks for the patch!" [puppet] - 10https://gerrit.wikimedia.org/r/1179749 (https://phabricator.wikimedia.org/T401695) (owner: 10CDanis) [20:42:16] (03CR) 10Krinkle: "I first had to resolve a conflict at https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143602 (11096035), but this is now live on beta" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [20:42:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81478 and previous config saved to /var/cache/conftool/dbconfig/20250818-204219-fceratto.json [20:42:53] (03PS2) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) [20:42:55] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [20:44:26] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-backup-datanode1033.eqiad.wmnet with OS bookworm [20:44:38] (03CR) 10Ssingh: "For setting it in Wikidough, you will have to pass use_new_pdns_cfg in modules/profile/manifests/wikidough.pp, line 56 or about. Then run " [puppet] - 10https://gerrit.wikimedia.org/r/1169156 (https://phabricator.wikimedia.org/T381608) (owner: 10CDobbins) [20:44:50] (03CR) 10Majavah: [C:03+2] hieradata: Add new codfw1dev bastion [puppet] - 10https://gerrit.wikimedia.org/r/1179756 (owner: 10Majavah) [20:46:55] (03CR) 10Krinkle: mediawiki: Remove unused wikidata.org vhost and fix redirect in beta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1179719 (https://phabricator.wikimedia.org/T401592) (owner: 10Krinkle) [20:53:31] !log dancy@deploy1003 Installing scap version "4.203.0" for 169 host(s) [20:57:25] !log dancy@deploy1003 Installation of scap version "4.203.0" completed for 169 hosts [20:57:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224', diff saved to https://phabricator.wikimedia.org/P81479 and previous config saved to /var/cache/conftool/dbconfig/20250818-205726-fceratto.json [21:00:05] Reedy, sbassett, Maryum, and manfredi: May I have your attention please! Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2100) [21:00:10] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099#11096081 (10phaultfinder) [21:01:37] Looks like nothing really went out during the backport window? So we should be good to do some security deployments? [21:06:42] 06SRE, 10MediaWiki-extensions-QuickInstantCommons, 10MediaWiki-File-management, 06MediaWiki-Platform-Team, and 2 others: Make InstantCommons and other uses of ForeignApiRepo use WMF policy-compliant user agents - https://phabricator.wikimedia.org/T400881#11096139 (10Tgr) Updated T400881#11072676 and the pa... [21:12:33] !log dancy@deploy1003 Started scap sync-world: testing [21:12:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2224 (T399249)', diff saved to https://phabricator.wikimedia.org/P81480 and previous config saved to /var/cache/conftool/dbconfig/20250818-211234-fceratto.json [21:12:39] T399249: Add cl_timestamp_id index to categorylinks table - https://phabricator.wikimedia.org/T399249 [21:13:28] …and scap is having some dockerd problems for now rn, so sec deployment is currently postponed [21:15:04] 10ops-codfw, 06SRE, 06DC-Ops: Alert for device ps1-b2-codfw.mgmt.codfw.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T402099#11096169 (10phaultfinder) [21:17:21] !log dancy@deploy1003 sync-world aborted: testing (duration: 04m 48s) [21:18:57] !log dancy@deploy1003 dancy: testing synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:19:26] !log dancy@deploy1003 Sync cancelled. [21:23:51] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1107 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:51] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1109 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1089 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1125 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1094 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1101 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1069 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1080 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:23:59] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1083 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:00] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1090 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:00] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1072 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1121 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:01] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1103 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:02] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1108 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:02] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1070 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1079 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:03] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1078 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:04] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1096 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:04] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1091 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1077 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1085 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1071 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1086 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:09] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1084 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1095 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1113 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1111 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1117 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1088 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:11] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1123 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:13] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1116 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:13] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch https://search.svc.eqiad.wmnet:9243/_cluster/health error while fetching: HTTPSConnectionPool(host=search.svc.eqiad.wmnet, port=9243): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1097 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:19] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1093 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:21] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1068 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1112 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1110 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:23] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1114 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1099 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:27] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1124 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1115 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1092 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1122 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1073 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1082 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:39] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1087 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:24:41] PROBLEM - OpenSearch health check for shards on 9200 on cirrussearch1120 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Read timed out. (read timeout=4) https://wikitech.wikimedia.org/wiki/Search%23Administration [21:33:30] !log bking@cumin1002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: apply logging config change - bking@cumin1002 - T395571 [21:33:34] T395571: Verify/fix Logstash pipeline/log rotate for Search Platform-owned OpenSearch clusters - https://phabricator.wikimedia.org/T395571 [21:34:19] ^^ expected, will re-downtime [21:35:40] !log bking@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 55 hosts with reason: T395571 [21:42:07] !log zabe@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-experimental: apply [21:45:14] !log zabe@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-experimental: apply [21:45:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1107 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 59, [21:45:45] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 7367, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:45] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1109 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 3, number_of_data_nodes: 3, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 59, [21:45:45] of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 7559, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1125 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:51] _of_in_flight_fetch: 7865, task_max_waiting_in_queue_millis: 304, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1080 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:51] _of_in_flight_fetch: 7865, task_max_waiting_in_queue_millis: 304, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1069 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:51] _of_in_flight_fetch: 5280, task_max_waiting_in_queue_millis: 303, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:51] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1089 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:52] _of_in_flight_fetch: 4290, task_max_waiting_in_queue_millis: 301, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:52] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1101 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:53] _of_in_flight_fetch: 7865, task_max_waiting_in_queue_millis: 304, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1083 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:54] _of_in_flight_fetch: 7700, task_max_waiting_in_queue_millis: 306, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:45:54] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1094 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 0, active_shards: 0, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 2 [21:45:55] _of_in_flight_fetch: 4235, task_max_waiting_in_queue_millis: 301, active_shards_percent_as_number: NaN https://wikitech.wikimedia.org/wiki/Search%23Administration [21:46:45] FIRING: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [21:46:45] CirrusSearch consumer-search@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [21:46:53] FIRING: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [21:46:59] CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [21:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:48:19] !log dancy@deploy1003 Installing scap version "4.205.0" for 2 host(s) [21:50:06] !log dancy@deploy1003 Installation of scap version "4.205.0" completed for 2 hosts [21:51:45] RESOLVED: CirrusStreamingUpdaterSetWeightedTagsTooLow: ... [21:51:51] CirrusSearch consumer-search@eqiad is setting too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterSetWeightedTagsTooLow [21:52:02] RESOLVED: CirrusStreamingUpdaterClearWeightedTagsTooLow: ... [21:52:08] CirrusSearch consumer-search@eqiad is clearing too few weighted tags - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/fe251f4f-f6cf-4010-8d78-5f482255b16f/cirrussearch-update-pipeline-weighted-tags?orgId=1&var-tag_prefix=All&var-search_cluster_site=eqiad&var-search_cluster=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterClearWeightedTagsTooLow [21:53:05] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1003" [21:53:06] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-backup-datanode1007.eqiad.wmnet with OS bookworm [21:53:20] !log btullis@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics[1072-1077].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [21:53:24] !log btullis@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: analytics[1072-1077].eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1003" [21:53:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:53:25] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts analytics[1072-1077].eqiad.wmnet [21:57:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1072 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3892, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 552, delayed_unassigned_shards: 0, number_of_pend [21:57:53] s: 2, number_of_in_flight_fetch: 110, task_max_waiting_in_queue_millis: 169, active_shards_percent_as_number: 87.10832587287378 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1121 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3892, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 552, delayed_unassigned_shards: 0, number_of_pend [21:57:53] s: 2, number_of_in_flight_fetch: 55, task_max_waiting_in_queue_millis: 178, active_shards_percent_as_number: 87.10832587287378 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1108 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3892, relocating_shards: 0, initializing_shards: 24, unassigned_shards: 552, delayed_unassigned_shards: 0, number_of_pend [21:57:53] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 194, active_shards_percent_as_number: 87.10832587287378 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:53] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1090 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3922, relocating_shards: 0, initializing_shards: 36, unassigned_shards: 510, delayed_unassigned_shards: 0, number_of_pend [21:57:54] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 911, active_shards_percent_as_number: 87.7797672336616 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:54] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1079 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3922, relocating_shards: 0, initializing_shards: 36, unassigned_shards: 510, delayed_unassigned_shards: 0, number_of_pend [21:57:55] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 905, active_shards_percent_as_number: 87.7797672336616 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:55] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1070 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3922, relocating_shards: 0, initializing_shards: 36, unassigned_shards: 510, delayed_unassigned_shards: 0, number_of_pend [21:57:56] s: 6, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 906, active_shards_percent_as_number: 87.7797672336616 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:56] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1103 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3922, relocating_shards: 0, initializing_shards: 36, unassigned_shards: 510, delayed_unassigned_shards: 0, number_of_pend [21:57:57] s: 10, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 932, active_shards_percent_as_number: 87.7797672336616 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:57] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1078 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3985, relocating_shards: 0, initializing_shards: 26, unassigned_shards: 457, delayed_unassigned_shards: 0, number_of_pend [21:57:58] s: 1, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 89.18979409131602 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:58] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1096 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3985, relocating_shards: 0, initializing_shards: 26, unassigned_shards: 457, delayed_unassigned_shards: 0, number_of_pend [21:57:59] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 10, active_shards_percent_as_number: 89.18979409131602 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:57:59] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1091 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 3985, relocating_shards: 0, initializing_shards: 26, unassigned_shards: 457, delayed_unassigned_shards: 0, number_of_pend [21:58:00] s: 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 39, active_shards_percent_as_number: 89.18979409131602 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:05] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1088 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4150, relocating_shards: 0, initializing_shards: 19, unassigned_shards: 299, delayed_unassigned_shards: 0, number_of_pend [21:58:05] s: 13, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 298, active_shards_percent_as_number: 92.88272157564906 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:05] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1086 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4150, relocating_shards: 0, initializing_shards: 19, unassigned_shards: 299, delayed_unassigned_shards: 0, number_of_pend [21:58:05] s: 13, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 297, active_shards_percent_as_number: 92.88272157564906 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:05] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1085 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4150, relocating_shards: 0, initializing_shards: 19, unassigned_shards: 299, delayed_unassigned_shards: 0, number_of_pend [21:58:05] s: 14, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 311, active_shards_percent_as_number: 92.88272157564906 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:05] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1077 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4150, relocating_shards: 0, initializing_shards: 19, unassigned_shards: 299, delayed_unassigned_shards: 0, number_of_pend [21:58:22] !log bking@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=search,name=eqiad [21:58:34] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1092 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4372, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 89, delayed_unassigned_shards: 0, number_of_pendin [21:58:34] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 15, active_shards_percent_as_number: 97.85138764547897 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:34] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1073 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4372, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 89, delayed_unassigned_shards: 0, number_of_pendin [21:58:34] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 14, active_shards_percent_as_number: 97.85138764547897 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:34] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1087 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4372, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 89, delayed_unassigned_shards: 0, number_of_pendin [21:58:34] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 9, active_shards_percent_as_number: 97.85138764547897 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:34] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1122 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4372, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 89, delayed_unassigned_shards: 0, number_of_pendin [21:58:34] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 22, active_shards_percent_as_number: 97.85138764547897 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:58:34] RECOVERY - OpenSearch health check for shards on 9200 on cirrussearch1082 is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: yellow, timed_out: False, number_of_nodes: 55, number_of_data_nodes: 55, discovered_master: True, active_primary_shards: 1524, active_shards: 4372, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 89, delayed_unassigned_shards: 0, number_of_pendin [21:58:34] 2, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 33, active_shards_percent_as_number: 97.85138764547897 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:00:58] jouncebot: nowandnext [22:00:59] For the next 0 hour(s) and 59 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2100) [22:00:59] In 0 hour(s) and 59 minute(s): Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2300) [22:01:36] FIRING: [2x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [22:01:54] !log Removed primary mitigation for T400697 [22:01:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:32] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11096267 (10sbassett) I've recently removed one of our private security mitigations that may have been causing some of these unintended consequences (the incident had been resolved an... [22:09:32] FIRING: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:12:20] (03CR) 10Daimona Eaytoy: [C:03+1] "(Seen)" [puppet] - 10https://gerrit.wikimedia.org/r/1179712 (https://phabricator.wikimedia.org/T401882) (owner: 10Muehlenhoff) [22:14:18] RESOLVED: [2x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1011:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:16:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:18:33] preparing to run scap for the security deploy [22:25:00] jouncebot: now [22:25:00] For the next 0 hour(s) and 34 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2100) [22:25:43] Ah, right. I need to drop two rows from prod to fix T402239 (UBN). Could y'all please let me know if/when I can do that? [22:25:44] T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239 [22:26:36] I'll be running scap for probably the next 15 mins but if that doesn't interfere with you, go ahead [22:26:43] currently running scap right now [22:26:57] daimona I'll post when I'm finished [22:27:33] (03PS2) 10Dbrant: mariadb: Document WikimediaEditorTasks tables. [puppet] - 10https://gerrit.wikimedia.org/r/1179695 (https://phabricator.wikimedia.org/T399302) [22:27:35] (03CR) 10Ladsgroup: [C:03+2] mariadb: Document WikimediaEditorTasks tables. [puppet] - 10https://gerrit.wikimedia.org/r/1179695 (https://phabricator.wikimedia.org/T399302) (owner: 10Dbrant) [22:27:37] (03CR) 10Ladsgroup: [V:03+2 C:03+2] mariadb: Document WikimediaEditorTasks tables. [puppet] - 10https://gerrit.wikimedia.org/r/1179695 (https://phabricator.wikimedia.org/T399302) (owner: 10Dbrant) [22:27:52] (03PS3) 10Ladsgroup: common.yaml: Remove private tables that have been cataloged [puppet] - 10https://gerrit.wikimedia.org/r/1179714 (https://phabricator.wikimedia.org/T399302) [22:27:58] (03CR) 10Ladsgroup: [V:03+2 C:03+2] common.yaml: Remove private tables that have been cataloged [puppet] - 10https://gerrit.wikimedia.org/r/1179714 (https://phabricator.wikimedia.org/T399302) (owner: 10Ladsgroup) [22:28:00] Thank you! I will wait, and look into the root cause in the meantime. [22:30:50] first scap finished [22:31:16] !log Deployed security fix for T402075 [22:31:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:29] about to running the second (and final) scap command [22:32:31] running second scap [22:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:36:34] scap finished [22:36:44] !log Deployed security patches for several extensions [22:36:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:49] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1071.eqiad.wmnet with reason: vacuum overlarge container dbs [22:37:01] daimona I'm done, it's all yours [22:37:08] Thank you! [22:40:03] !log Manually dropping DB rows in wikishared causing fatals # T402239#11096385 [22:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:40:08] T402239: RuntimeException: Event 1836 should have only one address. - https://phabricator.wikimedia.org/T402239 [22:43:46] (done) [22:46:03] maryum: May I do a deployment? [23:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250818T2300) [23:02:34] (03CR) 10Zabe: [C:03+2] Stop writing to cl_to and cl_collation on a few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179252 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [23:03:10] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1071.eqiad.wmnet with reason: vacuum overlarge container dbs [23:03:26] (03Merged) 10jenkins-bot: Stop writing to cl_to and cl_collation on a few large wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1179252 (https://phabricator.wikimedia.org/T399579) (owner: 10Zabe) [23:03:50] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] [23:03:54] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:04:33] FIRING: OsmSynchronisationLag: Maps - OSM synchronization lag - codfw - https://wikitech.wikimedia.org/wiki/Maps/Runbook - https://grafana.wikimedia.org/d/000000305/maps-performances - https://alerts.wikimedia.org/?q=alertname%3DOsmSynchronisationLag [23:04:33] FIRING: [2x] SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-codfw:et-0/0/6 (Core: lsw1-f2-codfw:ethernet-1/55 {#130117100025}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [23:07:49] RECOVERY - Disk space on ms-be1071 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1071&var-datasource=eqiad+prometheus/ops [23:08:18] \o/ [23:08:36] (see -persistence for more info) [23:11:25] FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:16:58] I was wandering why scap is taking so long [23:17:01] 23:08:13 [mediawiki-publish-83] OSError: [Errno 28] No space left on device [23:17:38] !log zabe@deploy1003 sync-world aborted: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] (duration: 13m 48s) [23:17:42] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:18:09] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] [23:18:40] lol, I think that's why swift is running out of disk [23:18:44] sigh [23:19:03] the images are stored in swift [23:19:26] but probably not properly? the tankers have a lot of space [23:19:58] !log zabe@deploy1003 sync-world aborted: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] (duration: 01m 49s) [23:21:04] Amir1: are you currently trying to free up some space or should I revert? [23:21:29] I am, but there might be more places with this issue [23:22:46] only thing else that alerted was ms-be1069 but that was fine when I checked it [23:23:16] can you try again and tell me what do you see as error? [23:23:29] PROBLEM - Disk space on deploy1003 is CRITICAL: DISK CRITICAL - free space: /srv 7477 MB (2% inode=68%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [23:23:47] yup, also deploy1003 is running out of space, yay [23:24:40] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] [23:24:44] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:25:38] /srv/homedirs/mwmaint1002 is the reason [23:26:25] RESOLVED: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:27] created T401647 only a few days ago [23:26:28] T401647: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647 [23:26:45] was apparently not enough [23:26:51] !log zabe@deploy1003 sync-world aborted: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] (duration: 02m 11s) [23:28:12] and here the error I see: https://phabricator.wikimedia.org/F65781013 [23:28:34] I'll see if I can free up some space [23:28:37] dropping directories of a couple of people who are long gone [23:30:25] FIRING: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:30] 117G free on /srv now. [23:30:37] Now to investigate spiderpig-jobrunner. [23:31:56] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11096500 (10Ladsgroup) We ran into this again tonight. This seems to be the biggest problem: ` root@deploy1003:/srv/homedirs# du -hs mwmaint* 14G mwmaint1002 13G mwmaint2002 ` [23:32:21] spiderpig-jobrunner failed due to disk space. Restarted. [23:33:10] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11096508 (10Ladsgroup) Top offenders: ` 1293600 tstarling 1774084 oblivian 1781788 samtar 2098668 cparle ` [23:33:15] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11096509 (10Ladsgroup) 05Resolved→03Open [23:34:15] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] [23:34:19] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:35:14] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11096512 (10dancy) I freed a ton of space by running `scap clean-images`. Note that currently you must be a member of the `docker` group to successfully run this command. [23:35:25] RESOLVED: SystemdUnitFailed: spiderpig-jobrunner.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:38:19] thanks dancy ! [23:38:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179762 [23:38:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179762 (owner: 10TrainBranchBot) [23:40:12] 06SRE, 06Traffic: Intermittent access issues to English Wikipedia on desktop/laptop - https://phabricator.wikimedia.org/T402142#11096515 (10Josve05a) >>! In T402142#11092492, @Josve05a wrote: > Additional report from another user (Ticket #2025081710003225) - second person today: > > - Device: HP Pavilion x360... [23:42:21] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11096519 (10dancy) Noting that we're building 8.1 and 8.3 multiversion images now, and the single "next" version images too. And there have been a few separate adjustments made to the base images recently, so... [23:43:29] RECOVERY - Disk space on deploy1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=deploy1003&var-datasource=eqiad+prometheus/ops [23:45:51] 06SRE, 06serviceops: deploy1003 running out of disk space - https://phabricator.wikimedia.org/T401647#11096527 (10Scott_French) I now wonder if this has also been exacerbated by {T402212}, which would have resulted in unnecessary full rebuilds. This should be better as of today, thanks to the workarounds @danc... [23:47:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to eqiad RIPE Atlas anchor: failures over threshold for measurement 96503802 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:48:15] 23:45:12 [mediawiki-publish-83] Waiting 300 seconds for swift after full mediawiki image build (T390251) [23:48:16] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [23:48:40] oof painful. [23:49:14] 06SRE, 10LDAP-Access-Requests: Grant Access to gerritadmin for qchris (NDA refresh) - https://phabricator.wikimedia.org/T400847#11096540 (10KFrancis) Hi all, the NDA is complete. Thanks! [23:50:18] is it doing a full build due to the cleanups and/or the disk space issues? [23:50:41] `23:34:34 [mediawiki-publish-83] Image docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-08-18-222147-publish-83 is not suitable due to rsync transfer pct 38.40373310826106 (threshold is 25)` [23:51:00] Did l10n rebuild happen? [23:51:17] No [23:51:25] huh. [23:51:56] `23:34:34 [mediawiki-publish-81] Using docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-08-18-232455-publish-81 as the base image (incremental build)` [23:51:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:52:02] https://phabricator.wikimedia.org/P81482 [23:52:34] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1179762 (owner: 10TrainBranchBot) [23:55:25] !log zabe@deploy1003 zabe: Backport for [[gerrit:1179252|Stop writing to cl_to and cl_collation on a few large wikis (T399579)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:55:29] T399579: Stop writing to cl_to and cl_collation - https://phabricator.wikimedia.org/T399579 [23:57:28] the difference in the timestamps (`222147` and `232455`) is curious ... apparently the last time the -83 image successfully built was during the security window, but perhaps not every build that needed to happen in fact did [23:59:22] !log zabe@deploy1003 zabe: Continuing with sync