[00:04:51] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [00:08:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163065 [00:08:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163065 (owner: 10TrainBranchBot) [00:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [00:28:24] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1163065 (owner: 10TrainBranchBot) [00:38:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:42:11] (03CR) 10Ssingh: [C:03+1] "Looks good from Traffic's end at least and so do the steps mentioned." [dns] - 10https://gerrit.wikimedia.org/r/1163055 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [00:45:04] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10940520 (10Scott_French) [00:59:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.45.0-wmf.7 [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163072 (https://phabricator.wikimedia.org/T392177) [01:08:14] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.45.0-wmf.7 [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163072 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [01:20:37] (03Merged) 10jenkins-bot: Branch commit for wmf/1.45.0-wmf.7 [core] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163072 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [01:30:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [01:50:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T0200) [02:30:41] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T0300) [03:00:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [03:01:52] (03PS1) 10TrainBranchBot: testwikis to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163080 (https://phabricator.wikimedia.org/T392177) [03:01:53] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163080 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [03:02:44] (03Merged) 10jenkins-bot: testwikis to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163080 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [03:03:05] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.45.0-wmf.7 refs T392177 [03:03:11] T392177: 1.45.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T392177 [03:13:32] (03PS2) 10Stang: zhwiki: Remove autopatrol from patroller group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163081 (https://phabricator.wikimedia.org/T397676) [03:40:34] PROBLEM - Disk space on mwdebug1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=79%): /tmp 0 MB (0% inode=79%): /var/tmp 0 MB (0% inode=79%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1002&var-datasource=eqiad+prometheus/ops [03:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:47:04] PROBLEM - Disk space on mwdebug1001 is CRITICAL: DISK CRITICAL - free space: / 1039 MB (2% inode=79%): /tmp 1039 MB (2% inode=79%): /var/tmp 1039 MB (2% inode=79%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [03:49:46] !log mwpresync@deploy1003 Finished scap sync-world: testwikis to 1.45.0-wmf.7 refs T392177 (duration: 46m 40s) [03:49:52] T392177: 1.45.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T392177 [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T0400) [04:03:12] !log mwpresync@deploy1003 Pruned MediaWiki: 1.45.0-wmf.3, 1.45.0-wmf.4 (duration: 03m 08s) [04:05:06] FIRING: [3x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:07:04] RECOVERY - Disk space on mwdebug1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1001&var-datasource=eqiad+prometheus/ops [04:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:20:34] RECOVERY - Disk space on mwdebug1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mwdebug1002&var-datasource=eqiad+prometheus/ops [04:25:45] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [04:33:58] PROBLEM - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:38:30] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:30:55] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [05:33:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web-next_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:34:33] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T0600). [06:26:17] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 24 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) (owner: 10Kosta Harlan) [06:30:41] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:23] (03PS7) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [06:42:56] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [06:53:49] (03CR) 10Volans: "Left couple of comments inline, LGTM otherwise. Couldn't run PCC as the host's facts are not yet present in PCC and the manual procedure a" [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [06:54:11] (03PS1) 10Muehlenhoff: Update account records for hghani [puppet] - 10https://gerrit.wikimedia.org/r/1163205 [06:54:22] (03CR) 10Tchanders: "We have approval from comms." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) (owner: 10Tchanders) [06:55:42] (03PS2) 10Tchanders: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) [06:58:41] (03PS8) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 [06:58:45] (03CR) 10Muehlenhoff: [C:03+2] Update account records for hghani [puppet] - 10https://gerrit.wikimedia.org/r/1163205 (owner: 10Muehlenhoff) [07:00:05] Amir1, Urbanecm, and awight: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T0700). [07:00:05] Tchanders and kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:09] \o [07:00:12] o/ [07:00:22] Shall I deploy this time? [07:00:23] long time no see :) [07:00:29] sure [07:00:32] :D [07:00:53] Ok. I'll do them separately in case of needing to roll back. Temp accounts first... [07:00:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:01:00] sounds good [07:01:22] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) (owner: 10Tchanders) [07:02:00] (03CR) 10Muehlenhoff: Role and default Hiera settings for debmonitor-dev (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [07:02:08] (03Merged) 10jenkins-bot: temp accounts: Enable temp account creation on further wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1155684 (https://phabricator.wikimedia.org/T396465) (owner: 10Tchanders) [07:02:30] !log disable puppet on cp7001 and depool to test new hiddenparma/varnish/haproxy syntax (T396621) [07:02:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:02:35] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [07:02:52] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1155684|temp accounts: Enable temp account creation on further wikis (T396465)]] [07:02:57] T396465: Temp Accounts: 24 June, 2025 deployment - https://phabricator.wikimedia.org/T396465 [07:04:20] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [07:06:36] !log tchanders@deploy1003 tchanders: Backport for [[gerrit:1155684|temp accounts: Enable temp account creation on further wikis (T396465)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:08:11] Taking a look at some RecentChanges pages and IP reveal tools... [07:09:22] (03CR) 10Jelto: [C:03+1] "I'm not sure if this solves the issue, the error message was ` Error creating: pods "mobileapps-staging-68c4cd89f5-5r7v9" is forbidden: fa" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162943 (owner: 10Jgiannelos) [07:09:25] (03PS1) 10Muehlenhoff: Record updated access dates [puppet] - 10https://gerrit.wikimedia.org/r/1163208 [07:10:23] (03CR) 10Ayounsi: "thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [07:10:38] (03PS5) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [07:11:20] (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [07:11:20] Continuing [07:11:28] !log tchanders@deploy1003 tchanders: Continuing with sync [07:11:38] (03CR) 10Muehlenhoff: [C:03+2] Role and default Hiera settings for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1162902 (owner: 10Muehlenhoff) [07:12:39] (03CR) 10Volans: Netbox: add primary_mac_address get/set (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [07:12:56] (03CR) 10Muehlenhoff: [C:03+2] Record updated access dates [puppet] - 10https://gerrit.wikimedia.org/r/1163208 (owner: 10Muehlenhoff) [07:13:00] (03PS1) 10Muehlenhoff: Assign debmonitor::server_dev role to debmonitor-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/1163210 [07:13:07] (03PS3) 10Arnaudb: gerrit: read-only plugin orchestration in failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) [07:13:07] (03CR) 10Arnaudb: "I've updated the wording, we decided to keep the "`watch ssh`" parsing as a TODO off band. I'll review all scripts to make sure there is n" [cookbooks] - 10https://gerrit.wikimedia.org/r/1159395 (https://phabricator.wikimedia.org/T395440) (owner: 10Arnaudb) [07:13:22] (03PS2) 10Federico Ceratto: wmf_root_client.pp: install wmfdb-admin on cumin [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) [07:13:47] (03CR) 10Federico Ceratto: "Updated using ensure_packages" [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) (owner: 10Federico Ceratto) [07:14:46] (03CR) 10Muehlenhoff: [C:03+2] Assign debmonitor::server_dev role to debmonitor-dev2001 [puppet] - 10https://gerrit.wikimedia.org/r/1163210 (owner: 10Muehlenhoff) [07:18:34] (03PS1) 10Muehlenhoff: Add dummy secrets for debmonitor_dev [labs/private] - 10https://gerrit.wikimedia.org/r/1163211 [07:18:55] !log switching off confd on cp7001 to perform tests on varnish/haproxy configuration files (T396621) [07:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:01] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [07:19:18] (03CR) 10Tchanders: [C:03+1] UserInfoCard: Enable by default for named users on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) (owner: 10Kosta Harlan) [07:20:16] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1155684|temp accounts: Enable temp account creation on further wikis (T396465)]] (duration: 17m 23s) [07:20:21] T396465: Temp Accounts: 24 June, 2025 deployment - https://phabricator.wikimedia.org/T396465 [07:21:04] Temp accounts done. kostajh: I'll start UserInfoCard [07:21:24] (03CR) 10Brouberol: [C:03+1] "Nicely done" [puppet] - 10https://gerrit.wikimedia.org/r/1163040 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [07:21:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) (owner: 10Kosta Harlan) [07:21:37] Tchanders: thanks! I can verify it when it's on mwdebug [07:22:14] (03Merged) 10jenkins-bot: UserInfoCard: Enable by default for named users on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162742 (https://phabricator.wikimedia.org/T397292) (owner: 10Kosta Harlan) [07:22:40] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1162742|UserInfoCard: Enable by default for named users on testwiki (T397292)]] [07:22:45] T397292: UserInfo: Enable on testwiki - https://phabricator.wikimedia.org/T397292 [07:24:14] (03PS1) 10Vgutierrez: hiera: Switch lvs3009 (upload) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1163212 (https://phabricator.wikimedia.org/T396561) [07:24:25] kostajh: OK [07:24:56] !log tchanders@deploy1003 kharlan, tchanders: Backport for [[gerrit:1162742|UserInfoCard: Enable by default for named users on testwiki (T397292)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:25:01] looking [07:25:48] (03PS1) 10Muehlenhoff: Add debmonitor-next.w.o to caches [puppet] - 10https://gerrit.wikimedia.org/r/1163218 [07:26:00] Tchanders: lgtm [07:26:44] kostajh: continuing [07:26:51] !log tchanders@deploy1003 kharlan, tchanders: Continuing with sync [07:28:24] (03PS1) 10Muehlenhoff: Add IDP config for debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1163257 [07:29:32] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163212 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [07:29:46] (03CR) 10MVernon: [C:03+1] "Hi," [dns] - 10https://gerrit.wikimedia.org/r/1163055 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [07:30:00] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) (owner: 10Federico Ceratto) [07:30:11] Tchanders: thanks for the deploys! [07:30:17] (03CR) 10Volans: Add IDP config for debmonitor-next (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163257 (owner: 10Muehlenhoff) [07:32:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:33:43] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1162742|UserInfoCard: Enable by default for named users on testwiki (T397292)]] (duration: 11m 03s) [07:33:48] T397292: UserInfo: Enable on testwiki - https://phabricator.wikimedia.org/T397292 [07:34:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:35:03] Deployments scheduled for the early window are finished [07:35:52] (03PS2) 10Muehlenhoff: Add IDP config for debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1163257 [07:36:07] (03CR) 10Muehlenhoff: Add IDP config for debmonitor-next (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163257 (owner: 10Muehlenhoff) [07:37:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [07:38:04] !log UTC morning deploys done [07:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:39:13] (03PS2) 10Urbanecm: [Growth] Disable the Surfacing Structured Tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) [07:39:23] (03PS3) 10Urbanecm: [Growth] Disable the Surfacing Structured Tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) [07:39:28] (03PS1) 10Jelto: remove kubectl-completion master group before adding alternatives [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163287 (https://phabricator.wikimedia.org/T387548) [07:39:36] (03PS2) 10Urbanecm: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) [07:41:03] (03PS1) 10Urbanecm: [Growth] Remove feature flags related to Surfacing Structured Tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) [07:41:49] (03PS1) 10Jeena Huneidi: Remove all references to patchdemo legacy [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) [07:41:53] (03PS2) 10Klausman: WIP: services/machinetranslation: adjust startup probe delays [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) [07:42:35] (03CR) 10CI reject: [V:04-1] Remove all references to patchdemo legacy [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) (owner: 10Jeena Huneidi) [07:43:59] (03PS2) 10Jeena Huneidi: Remove all references to patchdemo legacy [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) [07:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:45:54] (03CR) 10Elukey: [C:03+1] Add IDP config for debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1163257 (owner: 10Muehlenhoff) [07:46:09] (03CR) 10Elukey: [C:03+1] Add dummy secrets for debmonitor_dev [labs/private] - 10https://gerrit.wikimedia.org/r/1163211 (owner: 10Muehlenhoff) [07:46:55] (03CR) 10KartikMistry: [C:03+1] "You may want to remove WIP." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) (owner: 10Klausman) [07:48:37] (03CR) 10Elukey: "I think that we'd need to add the target config for ATS as well, so debmonitor-next points to the new VM that we created." [puppet] - 10https://gerrit.wikimedia.org/r/1163218 (owner: 10Muehlenhoff) [07:50:08] !log repooling cp7001 to test new varnish/haproxy reqctl rules syntax (T396621) [07:50:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:50:13] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [07:50:52] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [07:51:57] (03PS1) 10Jelto: remove kubectl-completion master group before adding alternatives [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163287 (https://phabricator.wikimedia.org/T387548) [07:51:57] (03CR) 10Jelto: [V:03+1] "I tested this locally. I installed `kubernetes-client123` `1.23.14-5` which defines the `kubectl-completion` as a master group. Then I ins" [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163287 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [07:52:45] (03PS1) 10Urbanecm: [Growth] testwiki: Enable the get-started-experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163292 (https://phabricator.wikimedia.org/T394958) [07:53:45] (03PS3) 10Urbanecm: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) [07:55:10] (03PS2) 10Urbanecm: [Growth] Remove feature flags related to Surfacing Structured Tasks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) [07:55:12] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1163257 (owner: 10Muehlenhoff) [07:55:12] (03PS5) 10Urbanecm: [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) [07:55:37] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1163218 (owner: 10Muehlenhoff) [07:59:27] (03CR) 10Federico Ceratto: [C:03+2] wmf_root_client.pp: install wmfdb-admin on cumin [puppet] - 10https://gerrit.wikimedia.org/r/1160707 (https://phabricator.wikimedia.org/T393990) (owner: 10Federico Ceratto) [08:01:04] (03CR) 10Elukey: [C:03+1] "I don't get how this should fix the error message listed in https://phabricator.wikimedia.org/T392851#10928899, my understanding was that " [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [08:02:22] (03PS1) 10Brouberol: postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) [08:04:14] (03CR) 10CI reject: [V:04-1] postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [08:06:34] (03CR) 10Elukey: "Left a little change to make, I am wondering at this point if we should just test with S3 how it goes, before making these changes. They m" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) (owner: 10Klausman) [08:09:18] (03PS2) 10Brouberol: postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) [08:09:19] (03PS1) 10Brouberol: cloudnative-pg-cluster: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163294 (https://phabricator.wikimedia.org/T393998) [08:09:49] PROBLEM - debmonitor.wikimedia.org:7443 CDN on debmonitor-dev2001 is CRITICAL: HTTP CRITICAL - Invalid HTTP response received from host on port 7443: HTTP/1.1 200 OK https://wikitech.wikimedia.org/wiki/Debmonitor [08:10:59] expected, host in WIP, let me silence it [08:11:54] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [08:12:08] !log volans@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on debmonitor-dev2001.codfw.wmnet with reason: Setting up debmonitor-next [08:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:13:52] (03Abandoned) 10Federico Ceratto: mysql: Add PhabricatorTask utility [cookbooks] - 10https://gerrit.wikimedia.org/r/1154240 (https://phabricator.wikimedia.org/T395427) (owner: 10Federico Ceratto) [08:14:08] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [08:16:45] (03PS1) 10Krinkle: beta: Prepare vhost and CSP settings for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) [08:19:45] (03PS2) 10Krinkle: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) [08:20:40] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:22:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [08:24:51] FIRING: [7x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:25:40] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-codfw and Lumen (2001:1900:2100::4b41) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [08:27:44] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [08:31:05] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [08:32:10] (03Abandoned) 10Kosta Harlan: ProofreadPage: Remove pagequality permission override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1080621 (https://phabricator.wikimedia.org/T326940) (owner: 10Kosta Harlan) [08:32:26] (03PS3) 10Krinkle: beta: Prepare vhost, CSP and UrlShortener for beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163295 (https://phabricator.wikimedia.org/T289318) [08:35:21] (03PS6) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [08:35:22] (03CR) 10Hashar: [C:03+1] "Looks good to me, I am letting @ebomani@wikimedia.org to review the code since she did wrote the code to support both patch demos system πŸ‘" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1163289 (https://phabricator.wikimedia.org/T391866) (owner: 10Jeena Huneidi) [08:35:44] (03PS1) 10Giuseppe Lavagetto: cache: matching known clients precedes cloud matching [puppet] - 10https://gerrit.wikimedia.org/r/1163297 [08:37:21] (03CR) 10Hashar: "@ltoscano@wikimedia.org can you +2 this one please? That is causing CI/tests to fail currently. Thanks!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156651 (owner: 10Hashar) [08:37:35] (03CR) 10Vgutierrez: [C:03+1] "logic looks good, actual syntax checks left to the linters :)" [puppet] - 10https://gerrit.wikimedia.org/r/1163297 (owner: 10Giuseppe Lavagetto) [08:37:52] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dummy secrets for debmonitor_dev [labs/private] - 10https://gerrit.wikimedia.org/r/1163211 (owner: 10Muehlenhoff) [08:39:25] (03PS1) 10Muehlenhoff: Disable notifications for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1163298 [08:40:00] (03PS1) 10Krinkle: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) [08:40:07] (03CR) 10Volans: [C:03+1] "LGTM, thx" [puppet] - 10https://gerrit.wikimedia.org/r/1163298 (owner: 10Muehlenhoff) [08:40:17] (03CR) 10Muehlenhoff: [C:03+2] Disable notifications for debmonitor-dev [puppet] - 10https://gerrit.wikimedia.org/r/1163298 (owner: 10Muehlenhoff) [08:40:24] (03CR) 10Elukey: [C:03+2] tox: pin mypy<1.16.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156651 (owner: 10Hashar) [08:40:31] (03CR) 10JMeybohm: [C:04-1] "I don't think this is the right approach. If kubernetes-client123 is installed and kubernetes-131 is upgraded after, the kubectl-complete " [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163287 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [08:42:21] (03CR) 10Ayounsi: Netbox: add primary_mac_address get/set (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [08:42:46] (03Merged) 10jenkins-bot: tox: pin mypy<1.16.0 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156651 (owner: 10Hashar) [08:44:29] (03PS3) 10Brouberol: postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) [08:44:59] (03PS2) 10Krinkle: multiversion: Add support for www.wikidata.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163299 (https://phabricator.wikimedia.org/T289318) [08:45:30] (03CR) 10Volans: Netbox: add primary_mac_address get/set (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [08:48:17] (03PS1) 10Vgutierrez: cache: Provide duration metrics on wmfuniq experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1163301 (https://phabricator.wikimedia.org/T395001) [08:49:00] (03PS2) 10Vgutierrez: cache: Provide duration metrics on wmfuniq experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1163301 (https://phabricator.wikimedia.org/T395001) [08:51:42] (03CR) 10Slyngshede: [C:03+2] admin: hashar: sync up shell aliases [puppet] - 10https://gerrit.wikimedia.org/r/1140648 (owner: 10Hashar) [08:51:55] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:52:11] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:52:25] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:53:09] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [08:54:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:56:14] (03PS18) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [08:57:07] 10SRE-tools, 10Spicerack: Increase the default batchsize of puppet.run() - https://phabricator.wikimedia.org/T397687 (10JMeybohm) 03NEW [08:57:08] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render [08:57:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render [08:57:20] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Increase the default batch size of puppet.run() - https://phabricator.wikimedia.org/T397687#10941151 (10JMeybohm) [08:58:11] (03PS3) 10Klausman: WIP: services/machinetranslation: adjust startup probe delays [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) [08:58:49] (03CR) 10Hnowlan: [C:03+2] changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [08:59:23] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render [08:59:31] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render [08:59:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:49] (03CR) 10Klausman: "I am fine with going either way. I am not sure what Kartik's available time is. He already filed change 1163291, so we might as well go wi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) (owner: 10Klausman) [09:01:59] (03CR) 10Muehlenhoff: "Ack, but this needs some other DNS change first. I'll make that patch as a followup." [puppet] - 10https://gerrit.wikimedia.org/r/1163218 (owner: 10Muehlenhoff) [09:02:06] (03CR) 10Muehlenhoff: [C:03+2] Add debmonitor-next.w.o to caches [puppet] - 10https://gerrit.wikimedia.org/r/1163218 (owner: 10Muehlenhoff) [09:02:09] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1006.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:02:26] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1007.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:02:29] (03CR) 10Giuseppe Lavagetto: [C:03+2] cache: matching known clients precedes cloud matching [puppet] - 10https://gerrit.wikimedia.org/r/1163297 (owner: 10Giuseppe Lavagetto) [09:02:39] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1008.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:02:42] (03PS7) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [09:02:45] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [09:03:23] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host aux-k8s-worker1009.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [09:03:37] (03PS1) 10Muehlenhoff: Add CNAME for debmonitor-next [dns] - 10https://gerrit.wikimedia.org/r/1163305 [09:04:46] (03CR) 10Ayounsi: Netbox: add primary_mac_address get/set (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [09:08:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:09:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-0/1/1:1 (Transport: cr2-eqiad:xe-3/2/2 (Lumen, 442550293) {#12253_12334-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:10:36] (03CR) 10Elukey: WIP: services/machinetranslation: adjust startup probe delays (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162985 (https://phabricator.wikimedia.org/T386889) (owner: 10Klausman) [09:11:26] (03PS19) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [09:12:01] (03PS20) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [09:12:05] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render [09:12:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render [09:12:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:12:46] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render [09:12:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render [09:13:45] FIRING: [3x] WidespreadPuppetFailure: Puppet has failed in eqsin - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:16:36] (03PS1) 10Muehlenhoff: Revert "Add debmonitor-next.w.o to caches" [puppet] - 10https://gerrit.wikimedia.org/r/1163307 [09:16:50] (03PS1) 10Jakob: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163306 [09:17:05] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [09:17:10] (03CR) 10Dima koushha: [C:03+1] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163306 (owner: 10Jakob) [09:17:18] (03CR) 10Muehlenhoff: [C:03+2] Revert "Add debmonitor-next.w.o to caches" [puppet] - 10https://gerrit.wikimedia.org/r/1163307 (owner: 10Muehlenhoff) [09:18:11] (03CR) 10Jakob: [C:03+2] wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163306 (owner: 10Jakob) [09:18:16] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [09:18:36] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:18:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:18:49] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:19:40] (03Merged) 10jenkins-bot: wikidata-query-gui: Bump query-gui image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163306 (owner: 10Jakob) [09:20:10] !log jakob@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [09:20:17] (03CR) 10Hnowlan: changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [09:20:24] !log jakob@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [09:20:26] (03CR) 10Hnowlan: [C:03+2] changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [09:20:55] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [09:21:26] !log jakob@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [09:21:40] !log jakob@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [09:22:10] !log jakob@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [09:22:32] !log jakob@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [09:27:02] (03PS21) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [09:28:32] !log mvernon@cumin1002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.ad in eqiad [09:28:32] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.remove-ghost-objects (exit_code=99) from container wikipedia-commons-local-public.ad in eqiad [09:30:06] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage [09:30:54] (03CR) 10Muehlenhoff: [C:03+2] Add CNAME for debmonitor-next [dns] - 10https://gerrit.wikimedia.org/r/1163305 (owner: 10Muehlenhoff) [09:31:14] !log jmm@dns1004 START - running authdns-update [09:31:27] (03CR) 10Effie Mouzeli: [C:03+1] mw-(api-ext|web): pilot 5% of traffic on new httpd images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162962 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [09:32:11] !log jmm@dns1004 END - running authdns-update [09:33:25] (03PS22) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [09:33:32] !log mvernon@cumin1002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.ad in eqiad [09:34:10] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1006.eqiad.wmnet with reason: host reimage [09:35:43] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [09:36:05] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.ad in eqiad [09:38:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:38:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:40:09] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [09:41:34] (03CR) 10Jgiannelos: [C:03+2] mobileapps: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162943 (owner: 10Jgiannelos) [09:41:49] (03PS1) 10Volans: images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 [09:43:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:44:28] (03CR) 10Volans: "I agree but that's what I got from manual reading, XML schema downloading and testing 😊" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162986 (https://phabricator.wikimedia.org/T392851) (owner: 10Volans) [09:46:41] (03CR) 10Fabfur: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1163212 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [09:47:08] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [09:48:45] FIRING: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:49:03] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1006.eqiad.wmnet with OS bookworm [09:49:25] !log deploying python3-wmflib v2.0.0 fleetwide [09:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:59] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1007.eqiad.wmnet with reason: host reimage [09:50:21] !log disable puppet and shutdown confd on A:cp to deploy new hiddenparma version (T396621_ [09:50:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:29] !log disable puppet and shutdown confd on A:cp to deploy new hiddenparma version (T396621) [09:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:34] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [09:52:32] (03PS1) 10Muehlenhoff: Add debmonitor-next to caches [puppet] - 10https://gerrit.wikimedia.org/r/1163311 [09:52:33] (03CR) 10Btullis: [C:03+1] cloudnative-pg-cluster: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163294 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:52:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:52:39] (03CR) 10Ladsgroup: [C:03+1] wikitech: remove logging configuration for hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161871 (https://phabricator.wikimedia.org/T371592) (owner: 10Hashar) [09:52:50] (03CR) 10Btullis: [C:03+1] postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:53:45] RESOLVED: [5x] WidespreadPuppetFailure: Puppet has failed in drmrs - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:53:51] (03PS1) 10Ladsgroup: Specify caller for query builder in GlobalJsonLinks [extensions/JsonConfig] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163312 [09:53:56] (03CR) 10Brouberol: [C:03+2] cloudnative-pg-cluster: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163294 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:53:59] (03CR) 10Brouberol: [C:03+2] postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:54:00] jouncebot: nowandnext [09:54:01] No deployments scheduled for the next 0 hour(s) and 5 minute(s) [09:54:01] In 0 hour(s) and 5 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1000) [09:54:47] !log confd shutodown on A:cp (T396621) [09:54:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:55] (03PS2) 10Volans: images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) [09:55:08] (03CR) 10Brouberol: airflow-test-k8s: bump the max_connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) (owner: 10Btullis) [09:55:58] (03Merged) 10jenkins-bot: cloudnative-pg-cluster: fix indentation [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163294 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:55:58] (03PS1) 10Fabfur: New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1163313 [09:56:01] (03Merged) 10jenkins-bot: postgresql-airflow-dev: triple max connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163293 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:56:57] !log remove kubernetes-client123 (1.23.14-5) form kubestargemaster100[3-5] - T387548 [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:02] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [09:57:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:57:38] (03CR) 10Fabfur: [V:03+2 C:03+2] New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1163313 (owner: 10Fabfur) [09:58:38] !log fabfur@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "x-provenance support - fabfur@cumin1002 - T396621" [09:58:39] !log fabfur@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: x-provenance support - fabfur@cumin1002 - T396621 [09:58:43] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [09:58:46] (03PS1) 10Ayounsi: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 [09:58:54] (03PS3) 10Btullis: airflow-test-k8s: bump the max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) [09:59:04] (03CR) 10Btullis: airflow-test-k8s: bump the max_connections (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) (owner: 10Btullis) [09:59:10] !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: x-provenance support - fabfur@cumin1002 - T396621 [09:59:11] !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "x-provenance support - fabfur@cumin1002 - T396621" [09:59:43] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163311 (owner: 10Muehlenhoff) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1000) [10:00:13] (03CR) 10Brouberol: [C:03+1] airflow-test-k8s: bump the max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) (owner: 10Btullis) [10:00:34] (03CR) 10Btullis: [V:03+1 C:03+2] Add the geoip databases to the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/1163040 (https://phabricator.wikimedia.org/T369845) (owner: 10Btullis) [10:00:43] (03CR) 10ClΓ©ment Goubert: [C:03+1] memcached: enable extstore on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [10:01:58] (03CR) 10Jgiannelos: [C:03+1] changeprop: emit abandoned events metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161893 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:02:16] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [10:02:20] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7002.magru.wmnet [10:02:46] (03Abandoned) 10Jelto: remove kubectl-completion master group before adding alternatives [debs/kubernetes] (v1.31) - 10https://gerrit.wikimedia.org/r/1163287 (https://phabricator.wikimedia.org/T387548) (owner: 10Jelto) [10:02:59] !log disable puppet on mc2* [10:03:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:12] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, one nit inline" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:03:27] (03CR) 10Effie Mouzeli: [C:03+2] memcached: enable extstore on codfw [puppet] - 10https://gerrit.wikimedia.org/r/1162904 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [10:04:14] !log cp700[1-2] depooled to test new hiddenparma rules syntax (T396621) [10:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:19] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [10:04:20] (03PS3) 10Volans: images: add a very simple API for the image detail [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) [10:04:20] (03CR) 10Jgiannelos: [C:03+2] push-notifications: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162942 (owner: 10Jgiannelos) [10:04:29] (03CR) 10Volans: "addressed comment" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:04:33] (03CR) 10Jgiannelos: [C:03+2] proton: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162941 (owner: 10Jgiannelos) [10:04:41] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162937 (owner: 10Jgiannelos) [10:06:01] (03CR) 10Btullis: [C:03+2] airflow-test-k8s: bump the max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) (owner: 10Btullis) [10:06:06] (03Merged) 10jenkins-bot: wikifeeds: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162937 (owner: 10Jgiannelos) [10:06:36] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1007.eqiad.wmnet with OS bookworm [10:06:36] (03Merged) 10jenkins-bot: mobileapps: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162943 (owner: 10Jgiannelos) [10:06:45] (03Merged) 10jenkins-bot: proton: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162941 (owner: 10Jgiannelos) [10:07:01] (03Merged) 10jenkins-bot: push-notifications: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162942 (owner: 10Jgiannelos) [10:07:12] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [10:07:37] (03CR) 10CI reject: [V:04-1] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [10:08:10] (03Merged) 10jenkins-bot: airflow-test-k8s: bump the max_connections [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161879 (https://phabricator.wikimedia.org/T391564) (owner: 10Btullis) [10:08:26] (03CR) 10Hnowlan: [C:03+2] changeprop: emit abandoned events metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161893 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:08:30] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [10:08:41] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:09:15] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1001.eqiad.wmnet [10:09:34] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2001.codfw.wmnet [10:09:48] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [10:10:18] (03Merged) 10jenkins-bot: changeprop: emit abandoned events metric [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161893 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:11:47] (03CR) 10Hnowlan: [C:03+2] changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:11:54] (03CR) 10CI reject: [V:04-1] changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:12:07] jelto: πŸ‘‹ My patch for mobileapps resources didn't do the trick :/ deployments still hang [10:13:05] jouncebot: nowandnext [10:13:06] For the next 0 hour(s) and 46 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1000) [10:13:06] In 1 hour(s) and 46 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1200) [10:14:45] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [10:15:05] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1001.eqiad.wmnet [10:15:51] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2001.codfw.wmnet [10:16:41] FIRING: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:16:46] (03PS2) 10Hnowlan: changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) [10:17:05] ^^ it's me [10:17:30] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:17:31] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf2002.codfw.wmnet [10:17:37] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc-wf1002.eqiad.wmnet [10:18:04] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3009.esams.wmnet} and A:liberica (T396561) [10:18:05] !log dropping searchindex table everywhere (T397367) [10:18:09] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [10:18:10] nemo-yiannis: The creation still fails with "Error creating: pods "mobileapps-staging-766cb9bc9-flw25" is forbidden: failed quota: quota-compute-resources: must specify limits.cpu for: staging-metrics-exporter; limits.memory for: staging-metrics-exporter" [10:18:10] I'm not fully sure about the statsd metric exporter, I'd recommend reaching out to serviceops [10:18:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:13] T397367: Drop unneeded empty tables from wikis - https://phabricator.wikimedia.org/T397367 [10:18:24] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3009.esams.wmnet} and A:liberica (T396561) [10:18:29] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:18:35] jelto: I filed a ticked and pinged folks on serviceops [10:18:41] great thank you [10:18:52] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:18:58] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs3009.esams.wmnet with reason: switching to katran [10:19:01] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs3009 (upload) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1163212 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:19:57] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:20:05] (03PS1) 10ClΓ©ment Goubert: sre.k8s.reboot-nodes: Raise max batch size to 20 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163317 [10:20:36] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:21:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:22:11] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:22:18] (03CR) 10Hnowlan: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:22:34] (03CR) 10JMeybohm: [C:03+1] "Yes!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163317 (owner: 10ClΓ©ment Goubert) [10:23:17] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf2002.codfw.wmnet [10:23:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-wf1002.eqiad.wmnet [10:24:07] (03Merged) 10jenkins-bot: changeprop: implement batch_size parameter for pcs job [deployment-charts] - 10https://gerrit.wikimedia.org/r/1161957 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:27:22] (03CR) 10ClΓ©ment Goubert: [C:03+2] sre.k8s.reboot-nodes: Raise max batch size to 20 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163317 (owner: 10ClΓ©ment Goubert) [10:27:56] (03PS1) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [10:28:09] (03PS1) 10Samwilson: InitialiseSettings: Enable TemplateDiscovery on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163319 (https://phabricator.wikimedia.org/T377975) [10:29:11] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:29:12] (03CR) 10Dr0ptp4kt: Add WMDE Fundraising banner event stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148322 (owner: 10Abban Dunne) [10:29:14] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs3009.esams.wmnet [10:29:14] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs3009.esams.wmnet [10:29:18] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:29:42] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:29:47] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:29:50] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1053.eqiad.wmnet [10:29:51] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:29:55] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:29:55] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:29:58] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [10:30:05] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [10:30:11] (03CR) 10LD: [C:03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163081 (https://phabricator.wikimedia.org/T397676) (owner: 10Stang) [10:30:11] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [10:30:40] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:31:10] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/push-notifications: apply [10:31:26] (03PS1) 10Fabfur: New release [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1163321 [10:31:39] (03PS1) 10Vgutierrez: hiera: Repool lvs3009 using katran [puppet] - 10https://gerrit.wikimedia.org/r/1163322 (https://phabricator.wikimedia.org/T396561) [10:31:42] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-snippets [10:31:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox-snippets (exit_code=0) [10:31:46] (03CR) 10Fabfur: [V:03+2 C:03+2] "new release to fix previous one" [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1163321 (owner: 10Fabfur) [10:31:46] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [10:32:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163322 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:32:41] !log fabfur@cumin1002 START - Cookbook sre.deploy.hiddenparma Hiddenparma deployment to the alerting hosts with reason: "Fix previous release - fabfur@cumin1002 - T396621" [10:32:42] !log fabfur@cumin1002 START - Cookbook sre.deploy.python-code hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix previous release - fabfur@cumin1002 - T396621 [10:32:47] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [10:32:47] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [10:33:14] !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) hiddenparma to alert[1002,2002].wikimedia.org with reason: Fix previous release - fabfur@cumin1002 - T396621 [10:33:15] !log fabfur@cumin1002 END (PASS) - Cookbook sre.deploy.hiddenparma (exit_code=0) Hiddenparma deployment to the alerting hosts with reason: "Fix previous release - fabfur@cumin1002 - T396621" [10:33:21] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs3009 using katran [puppet] - 10https://gerrit.wikimedia.org/r/1163322 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:34:39] (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Raise max batch size to 20 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163317 (owner: 10ClΓ©ment Goubert) [10:34:55] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [10:35:30] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1053.eqiad.wmnet [10:36:15] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3009.esams.wmnet} and A:liberica (T396561) [10:36:17] !log repool lvs3009 using katran - T396561 [10:36:21] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [10:36:22] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3009.esams.wmnet} and A:liberica (T396561) [10:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:41] RESOLVED: ConfdResourceFailed: confd resource _etc_haproxy_conf.d_tls.cfg.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [10:37:24] (03PS23) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [10:37:41] !log mvernon@cumin1002 START - Cookbook sre.swift.remove-ghost-objects from container wikipedia-commons-local-public.ad in eqiad [10:38:27] (03PS2) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [10:38:33] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-snippets [10:38:33] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox-snippets (exit_code=99) [10:38:42] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:39:05] (03PS1) 10Ladsgroup: Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) [10:40:06] (03CR) 10Michael Große: [C:03+1] [Growth] Disable the Surfacing Structured Tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [10:40:13] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-snippets [10:40:13] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox-snippets (exit_code=99) [10:40:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.remove-ghost-objects (exit_code=0) from container wikipedia-commons-local-public.ad in eqiad [10:40:19] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-snippets [10:40:20] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox-snippets (exit_code=99) [10:40:51] (03PS3) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [10:40:56] (03CR) 10Michael Große: [C:03+1] [Growth] testwiki: Enable the get-started-experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163292 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [10:40:57] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-snippets [10:40:57] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox-snippets (exit_code=0) [10:41:03] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/proton: apply [10:41:23] !log repooling cp700[1-2] (T396621) [10:41:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:28] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [10:41:39] (03CR) 10Michael Große: [C:03+1] [Growth] Remove support code for Surfacing Structured Tasks experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163028 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [10:42:22] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/proton: apply [10:42:39] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.30 [10:44:00] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7002.magru.wmnet [10:44:04] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [10:44:05] (03PS8) 10Ayounsi: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 [10:44:05] (03PS2) 10Ayounsi: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 [10:44:12] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7009.magru.wmnet [10:44:32] !log depooled cp7009 to test for requestctl new syntax (T396621) [10:44:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:47] (03PS1) 10Vgutierrez: hiera: Switch lvs3008 (text) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1163325 (https://phabricator.wikimedia.org/T396561) [10:44:58] (03PS1) 10Hnowlan: mobileapps: set resource limits for statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163326 (https://phabricator.wikimedia.org/T397703) [10:45:17] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163325 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:45:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 24 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163319 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [10:46:52] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [10:49:49] (03PS24) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [10:50:55] (03CR) 10ClΓ©ment Goubert: [C:03+1] mobileapps: set resource limits for statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163326 (https://phabricator.wikimedia.org/T397703) (owner: 10Hnowlan) [10:51:32] (03CR) 10Michael Große: "Looks good, but should this wait until the related code has actually been removed from production by the train rolling out to Group 2 next" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [10:51:53] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:52:08] (03CR) 10Urbanecm: "As I indicated in the commit message, waiting for the code to be removed definitely makes sense to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163288 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [10:52:14] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:52:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.30 [10:52:22] (03CR) 10Hnowlan: [C:03+2] mobileapps: set resource limits for statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163326 (https://phabricator.wikimedia.org/T397703) (owner: 10Hnowlan) [10:52:26] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1150.eqiad.wmnet [10:52:46] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163310 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [10:53:08] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7009.magru.wmnet [10:54:01] (03Merged) 10jenkins-bot: mobileapps: set resource limits for statsd exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163326 (https://phabricator.wikimedia.org/T397703) (owner: 10Hnowlan) [10:54:28] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1150.eqiad.wmnet [10:55:00] (03PS1) 10Brouberol: postgresql-airflow-test-k8s: rename value file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163329 (https://phabricator.wikimedia.org/T391564) [10:55:22] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1151.eqiad.wmnet [10:55:49] (03CR) 10Btullis: [C:03+1] postgresql-airflow-test-k8s: rename value file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163329 (https://phabricator.wikimedia.org/T391564) (owner: 10Brouberol) [10:55:55] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:55:57] !log enable puppet and confd on A:cp (T396621) [10:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:04] T396621: Requestctl should use x-provenance header - https://phabricator.wikimedia.org/T396621 [10:56:10] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:56:58] (03CR) 10Brouberol: [C:03+2] postgresql-airflow-test-k8s: rename value file [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163329 (https://phabricator.wikimedia.org/T391564) (owner: 10Brouberol) [10:57:05] (03PS1) 10Alexandros Kosiaris: WIP: Support arm64 in sre.hosts.provision [cookbooks] - 10https://gerrit.wikimedia.org/r/1163330 (https://phabricator.wikimedia.org/T397653) [10:57:21] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1151.eqiad.wmnet [10:57:31] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.00 [10:57:45] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1152.eqiad.wmnet [10:58:20] (03CR) 10Ladsgroup: [C:03+2] Specify caller for query builder in GlobalJsonLinks [extensions/JsonConfig] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163312 (owner: 10Ladsgroup) [10:58:33] (03CR) 10Fabfur: [C:03+1] "gl!" [puppet] - 10https://gerrit.wikimedia.org/r/1163325 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [10:58:58] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:59:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161871 (https://phabricator.wikimedia.org/T371592) (owner: 10Hashar) [10:59:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [extensions/JsonConfig] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163312 (owner: 10Ladsgroup) [10:59:24] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:59:40] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [10:59:41] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1152.eqiad.wmnet [10:59:47] (03Merged) 10jenkins-bot: Specify caller for query builder in GlobalJsonLinks [extensions/JsonConfig] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163312 (owner: 10Ladsgroup) [11:00:19] (03Merged) 10jenkins-bot: wikitech: remove logging configuration for hooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1161871 (https://phabricator.wikimedia.org/T371592) (owner: 10Hashar) [11:00:52] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1161871|wikitech: remove logging configuration for hooks (T371592 T371374)]], [[gerrit:1163312|Specify caller for query builder in GlobalJsonLinks]] [11:00:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:00:59] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [11:00:59] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [11:01:38] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1153.eqiad.wmnet [11:02:24] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:02:58] (03PS4) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:03:04] !log ladsgroup@deploy1003 ladsgroup, hashar: Backport for [[gerrit:1161871|wikitech: remove logging configuration for hooks (T371592 T371374)]], [[gerrit:1163312|Specify caller for query builder in GlobalJsonLinks]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:03:40] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1153.eqiad.wmnet [11:03:47] !log akosiaris@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [11:04:00] !log ladsgroup@deploy1003 ladsgroup, hashar: Continuing with sync [11:05:22] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1154.eqiad.wmnet [11:05:50] (03PS25) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [11:06:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.00 [11:06:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.01 [11:06:59] (03PS5) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:07:00] (03PS1) 10ClΓ©ment Goubert: sre.k8s.K8sBatchRunnerBase: Add minimal-cordon arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1163328 [11:07:24] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1154.eqiad.wmnet [11:09:01] (03CR) 10Ladsgroup: "yeah, I was planning to use that" [puppet] - 10https://gerrit.wikimedia.org/r/1155210 (https://phabricator.wikimedia.org/T363581) (owner: 10Ladsgroup) [11:10:54] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1161871|wikitech: remove logging configuration for hooks (T371592 T371374)]], [[gerrit:1163312|Specify caller for query builder in GlobalJsonLinks]] (duration: 10m 01s) [11:11:00] T371592: LdapAuthentication: Disable extension from Wikitech - https://phabricator.wikimedia.org/T371592 [11:11:00] T371374: mediawiki-config: consolidate labswiki - https://phabricator.wikimedia.org/T371374 [11:11:43] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:12:12] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [11:12:20] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [11:12:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-test-k8s: apply [11:12:41] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1176.eqiad.wmnet [11:12:51] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10941797 (10ops-monitoring-bot) Host an-worker1176.eqiad.wmnet rebooted by stevemunene@cumin1002 with reason: Rebooting... [11:13:36] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:13:46] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.01 [11:13:48] FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:14:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:16:07] (03PS26) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [11:20:24] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:20:28] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local.thumb.02 [11:20:29] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local.thumb.02 [11:20:32] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local.thumb.03 [11:20:33] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local.thumb.03 [11:20:36] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local.thumb.04 [11:20:37] !log mvernon@cumin1002 END (FAIL) - Cookbook sre.swift.check-dbs (exit_code=99) Checking container DBs of wikipedia-commons-local.thumb.04 [11:21:48] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of global-data-captcha-render [11:21:49] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1054.eqiad.wmnet [11:21:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of global-data-captcha-render [11:22:02] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [11:22:16] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: sync [11:23:04] (03CR) 10CI reject: [V:04-1] swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [11:23:22] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.02 [11:23:44] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10941819 (10Ladsgroup) To be more specific, I'm currently seeing 0.1% growth every four days = 1% every forty days = 10% every 400 days. [11:23:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:24:15] (03PS27) 10MVernon: swift: add a check-dbs cookbook to check swift container dbs [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 [11:27:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1054.eqiad.wmnet [11:28:06] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [11:28:48] !log bump kubernetes-client to newest version on kubestagemaster100[3-5] - T387548 [11:28:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:53] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [11:28:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10941828 (10Stevemunene) Looking into an-worker1176 which is stuck booting due to an I/O error below ` [21148730.551862... [11:31:06] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: sync [11:31:52] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.02 [11:31:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.03 [11:32:01] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1052.eqiad.wmnet [11:33:46] (03PS1) 10Btullis: dse-k8s: Increase maximum container/pod size for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) [11:36:13] (03PS6) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:36:35] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [11:36:35] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_snippets (exit_code=99) Generate and push DNS records from Netbox data [11:37:30] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [11:37:53] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1052.eqiad.wmnet [11:38:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.03 [11:39:02] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.04 [11:40:09] (03CR) 10JMeybohm: [C:03+1] sre.k8s.K8sBatchRunnerBase: Add minimal-cordon arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1163328 (owner: 10ClΓ©ment Goubert) [11:40:12] (03CR) 10CI reject: [V:04-1] dse-k8s: Increase maximum container/pod size for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [11:40:27] (03CR) 10ClΓ©ment Goubert: [C:03+2] sre.k8s.K8sBatchRunnerBase: Add minimal-cordon arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1163328 (owner: 10ClΓ©ment Goubert) [11:42:33] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:43:15] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [11:44:25] !log bump kubernetes-client to newest version on deploy1003 and deploy2002 - T387548 [11:44:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:30] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [11:45:15] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [11:45:25] (03PS2) 10Btullis: dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) [11:45:40] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:46:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.04 [11:47:35] (03PS1) 10Hnowlan: mobileapps: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163342 (https://phabricator.wikimedia.org/T397703) [11:47:39] (03Merged) 10jenkins-bot: sre.k8s.K8sBatchRunnerBase: Add minimal-cordon arg [cookbooks] - 10https://gerrit.wikimedia.org/r/1163328 (owner: 10ClΓ©ment Goubert) [11:48:39] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [11:48:39] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_snippets (exit_code=99) Generate and push DNS records from Netbox data [11:50:05] (03PS7) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:50:10] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [11:50:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [11:50:18] (03CR) 10Hnowlan: [C:03+2] mobileapps: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163342 (https://phabricator.wikimedia.org/T397703) (owner: 10Hnowlan) [11:51:39] (03CR) 10CI reject: [V:04-1] dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [11:51:56] (03Merged) 10jenkins-bot: mobileapps: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163342 (https://phabricator.wikimedia.org/T397703) (owner: 10Hnowlan) [11:54:04] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host an-worker1176.eqiad.wmnet [11:55:53] (03CR) 10MVernon: "@rcoccioli@wikimedia.org thanks for your comments on this, it's ready for a re-review now (and I've used it quite a bit already via test-c" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146007 (owner: 10MVernon) [11:56:14] (03PS3) 10Btullis: dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) [11:56:45] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [11:56:58] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage [11:57:01] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [11:57:46] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.05 [11:57:46] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [11:58:13] (03PS8) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:58:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [11:58:18] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_snippets (exit_code=99) Generate and push DNS records from Netbox data [11:58:50] (03PS9) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [11:58:57] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [11:58:58] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:00:01] akosiaris@cumin1003 reimage (PID 2976928) is awaiting input [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1200) [12:02:11] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1008.eqiad.wmnet with reason: host reimage [12:02:12] (03PS10) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:02:13] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:02:14] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:02:18] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:02:19] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:02:29] (03CR) 10CI reject: [V:04-1] dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [12:03:05] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136#10941924 (10Aklapper) a:05MoritzMuehlenhoffβ†’03None @MoritzMuehlenhoff Removing task assignee as this open task has been assigned for more than two... [12:03:49] (03PS4) 10Btullis: dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) [12:03:51] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [12:03:53] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#10941960 (10Aklapper) a:05Jclark-ctrβ†’03None @Jclark-ctr Removing task assignee as this open task has been assigned for more than two years... [12:03:57] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [12:04:56] !log akosiaris@cumin1003 START - Cookbook sre.hosts.reimage for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [12:04:57] !log check thumbnail db integrity T383053 [12:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:05:06] T383053: Container dbs for wikipedia-commons-local-thumb.f8 AWOL in codfw due to corruption - https://phabricator.wikimedia.org/T383053 [12:05:09] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:05:10] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:05:41] 06SRE, 06SRE-OnFire, 13Patch-Needs-Improvement: klaxon CLI tool for seeding an oncall handoff - https://phabricator.wikimedia.org/T317159#10942021 (10Aklapper) a:05CDanisβ†’03None @CDanis Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.... [12:05:49] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786#10942023 (10Aklapper) a:05Clement_Goubertβ†’03None @Clement_Goubert Removing task assignee as this open task has been assigned for more than t... [12:05:58] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1051.eqiad.wmnet [12:06:27] (03PS11) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:06:45] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2042.codfw.wmnet [12:06:49] 06SRE, 06serviceops, 10MediaWiki-Platform-Team (Radar): k8s/mw: traffic to eventgate dropped by iptables - https://phabricator.wikimedia.org/T249700#10942047 (10Aklapper) a:05akosiarisβ†’03None @akosiaris Removing task assignee as this open task has been assigned for more than two years - See the email sen... [12:07:37] 06SRE, 06Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431#10942073 (10Aklapper) a:05ssinghβ†’03None @ssingh Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this t... [12:08:07] 06SRE, 06All-and-every-Wikisource, 06Product-Analytics, 07Bengali-Sites, 07SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607#10942084 (10Aklapper) a:05SCherukuwadaβ†’03None @SCherukuwada Removing task assignee as this open task has been assigned for more... [12:09:11] (03PS1) 10Hnowlan: mobileapps: increase memory limit, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163345 (https://phabricator.wikimedia.org/T397072) [12:09:11] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.05 [12:09:14] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.06 [12:09:33] (03PS12) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:09:40] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:09:40] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_snippets (exit_code=99) Generate and push DNS records from Netbox data [12:10:05] (03PS4) 10Urbanecm: [Growth] Disable the Surfacing Structured Tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) [12:10:24] (03PS13) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:10:30] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:10:30] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:10:52] (03CR) 10Urbanecm: [C:03+2] "let's ship this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [12:11:38] (03PS14) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:11:38] (03Merged) 10jenkins-bot: [Growth] Disable the Surfacing Structured Tasks feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163027 (https://phabricator.wikimedia.org/T397515) (owner: 10Urbanecm) [12:11:39] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1051.eqiad.wmnet [12:11:43] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:11:44] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:12:23] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1163027|[Growth] Disable the Surfacing Structured Tasks feature (T397515)]] [12:12:28] T397515: End the Surfacing Structured Tasks experiment - https://phabricator.wikimedia.org/T397515 [12:12:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10942114 (10Jclark-ctr) Per conversations with @stevemunene in IRC we swapped 1149 for 1175 since i... [12:12:29] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2042.codfw.wmnet [12:12:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 10 - multiple racks - singletons) - https://phabricator.wikimedia.org/T390178#10942116 (10Jclark-ctr) [12:13:18] !log bump kubernetes-client to newest version on ml-staging-ctrl200[12] - T387548 [12:13:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:23] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [12:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:14:36] (03CR) 10Brouberol: [C:03+1] dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [12:14:38] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1163027|[Growth] Disable the Surfacing Structured Tasks feature (T397515)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:16:15] feature is indeed disabled on debug server [12:16:27] !log urbanecm@deploy1003 urbanecm: Continuing with sync [12:16:33] !log akosiaris@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage [12:16:46] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1008.eqiad.wmnet with OS bookworm [12:17:27] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1163311 (owner: 10Muehlenhoff) [12:17:35] !log cgoubert@cumin1003 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-worker-codfw [12:18:22] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [12:18:27] (03PS2) 10Urbanecm: [Growth] testwiki: Enable the get-started-experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163292 (https://phabricator.wikimedia.org/T394958) [12:18:36] (03PS15) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [12:18:42] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_snippets Generate and push DNS records from Netbox data [12:18:43] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox_snippets (exit_code=0) Generate and push DNS records from Netbox data [12:18:45] (03CR) 10Urbanecm: [C:03+2] [Growth] testwiki: Enable the get-started-experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163292 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [12:19:32] (03Merged) 10jenkins-bot: [Growth] testwiki: Enable the get-started-experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163292 (https://phabricator.wikimedia.org/T394958) (owner: 10Urbanecm) [12:19:56] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aux-k8s-worker1009.eqiad.wmnet with reason: host reimage [12:20:47] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.06 [12:20:50] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.07 [12:21:08] 06SRE, 10Beta-Cluster-Infrastructure, 06Data-Persistence, 06Traffic: ATS isn't caching documents in deployment-cache-upload07 - https://phabricator.wikimedia.org/T322575#10942159 (10Aklapper) a:05Vgutierrezβ†’03None @Vgutierrez: Removing task assignee as this open task has been assigned for more than two... [12:22:30] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [12:22:35] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [12:23:14] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163027|[Growth] Disable the Surfacing Structured Tasks feature (T397515)]] (duration: 10m 51s) [12:23:20] T397515: End the Surfacing Structured Tasks experiment - https://phabricator.wikimedia.org/T397515 [12:23:36] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06serviceops-radar, 10Sustainability (Incident Followup): Create an automated alert for 'too many nodes depooled from a service' - https://phabricator.wikimedia.org/T245058#10942242 (10Aklapper) a:05Joeβ†’03None @Joe: Removing task assignee as this... [12:23:49] 14SRE-Sprint-Week-Sustainability-March2023, 10conftool, 06serviceops-radar, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059#10942245 (10Aklapper) a:05Joeβ†’0... [12:24:32] !log urbanecm@deploy1003 Started scap sync-world: Backport for [[gerrit:1163292|[Growth] testwiki: Enable the get-started-experiment (T394958)]] [12:24:38] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [12:25:32] 06SRE, 06serviceops, 10Wikimedia-Apache-configuration: Incorrect handling of ETags taking precedence over timestamps in conditional requests - https://phabricator.wikimedia.org/T320241#10942309 (10Aklapper) a:05jijikiβ†’03None @jijiki: Removing task assignee as this open task has been assigned for more tha... [12:27:07] (03CR) 10Effie Mouzeli: [C:03+1] mobileapps: increase memory limit, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163345 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [12:27:28] !log urbanecm@deploy1003 urbanecm: Backport for [[gerrit:1163292|[Growth] testwiki: Enable the get-started-experiment (T394958)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:28:23] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277#10942407 (10Aklapper) a:05SLyngshede-WMFβ†’03None @SLyngshede-WMF: Removing task assignee as this open task has been assigned for more than two years - See the e... [12:28:29] 06SRE, 10Bitu, 06Infrastructure-Foundations: Display meta.wikimedia.org username, if authenticated, before linking - https://phabricator.wikimedia.org/T335955#10942410 (10Aklapper) a:05SLyngshede-WMFβ†’03None @SLyngshede-WMF: Removing task assignee as this open task has been assigned for more than two year... [12:28:33] 06SRE, 06Infrastructure-Foundations, 10netops: Store network users in Bitu/LDAP - https://phabricator.wikimedia.org/T335870#10942412 (10Aklapper) a:05SLyngshede-WMFβ†’03None @SLyngshede-WMF: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-2... [12:28:41] 06SRE, 10Bitu, 06Infrastructure-Foundations: Bitu IDM - Feedback - https://phabricator.wikimedia.org/T335470#10942415 (10Aklapper) a:05SLyngshede-WMFβ†’03None @SLyngshede-WMF: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assig... [12:29:25] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-in-lists - https://phabricator.wikimedia.org/T325404#10942459 (10Aklapper) a:05jhathawayβ†’03None @jhathaway: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this t... [12:29:31] 06SRE, 06Infrastructure-Foundations, 10Mail: Provision mx-out-lists - https://phabricator.wikimedia.org/T325405#10942454 (10Aklapper) a:05jhathawayβ†’03None @jhathaway: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this... [12:29:47] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim with Postfix on mail servers - https://phabricator.wikimedia.org/T325394#10942461 (10Aklapper) a:05jhathawayβ†’03None @jhathaway: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22.... [12:29:53] 06SRE, 06Infrastructure-Foundations, 10Mail: Replace Exim null client config with a Postfix null client config - https://phabricator.wikimedia.org/T325408#10942468 (10Aklapper) a:05jhathawayβ†’03None @jhathaway: Removing task assignee as this open task has been assigned for more than two years - See the em... [12:30:03] 10SRE-swift-storage, 10Thumbor: Thumbor 404s on an auth failure to Swift - https://phabricator.wikimedia.org/T332210#10942474 (10Aklapper) a:05hnowlanβ†’03None @hnowlan: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this t... [12:30:07] 06SRE, 06serviceops, 10Thumbor: Thumbor-k8s performance improvements - https://phabricator.wikimedia.org/T333445#10942470 (10Aklapper) a:05hnowlanβ†’03None @hnowlan: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this tas... [12:30:17] 06SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699#10942478 (10Aklapper) a:05hnowlanβ†’03None @hnowlan: Removing task assignee as this open task has been assigned for more than two years - See the email sent on 2025-05-22. Please assign this task... [12:30:27] (03PS1) 10Jgiannelos: proton: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163348 [12:30:39] 06SRE, 10Observability-Metrics, 06serviceops, 10Maps (Kartotherian): Get Kartotherian SLO metrics into Prometheus - https://phabricator.wikimedia.org/T320748#10942480 (10Aklapper) a:05hnowlanβ†’03None @hnowlan: Removing task assignee as this open task has been assigned for more than two years - See the e... [12:31:12] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/proton: apply [12:32:02] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [12:32:04] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.07 [12:32:07] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.08 [12:32:15] (03PS1) 10Jgiannelos: wikifeeds: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163349 [12:32:22] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#10942550 (10Jclark-ctr) @Aklapper @ayounsi I hadn’t commented earlier because we needed to verify onsite that we still had enough available por... [12:32:25] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/proton: apply [12:32:26] (03CR) 10Elukey: [C:03+1] Add debmonitor-next to caches [puppet] - 10https://gerrit.wikimedia.org/r/1163311 (owner: 10Muehlenhoff) [12:32:30] (03CR) 10Jgiannelos: [C:03+2] proton: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163348 (owner: 10Jgiannelos) [12:32:34] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/proton: apply [12:33:06] (03CR) 10Jgiannelos: [C:03+2] wikifeeds: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163349 (owner: 10Jgiannelos) [12:33:10] (03PS4) 10Cmelo: Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) [12:33:14] (03Merged) 10jenkins-bot: proton: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163348 (owner: 10Jgiannelos) [12:33:18] 06SRE, 10RESTBase: Restbase: traffic to 3050/udp dropped by iptables - https://phabricator.wikimedia.org/T249699#10942559 (10hnowlan) 05Openβ†’03Declined This is probably still an issue, but it is not something we'll fix as almost every component has been migrated out of restbase. [12:33:22] (03CR) 10CI reject: [V:04-1] Release the CampaignEvents extension to all Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1162967 (https://phabricator.wikimedia.org/T396784) (owner: 10Cmelo) [12:33:28] (03PS1) 10Jgiannelos: push-notifications: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163350 [12:33:36] (03CR) 10Muehlenhoff: [C:03+2] Add debmonitor-next to caches [puppet] - 10https://gerrit.wikimedia.org/r/1163311 (owner: 10Muehlenhoff) [12:33:40] (03Merged) 10jenkins-bot: wikifeeds: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163349 (owner: 10Jgiannelos) [12:33:50] (03PS1) 10Jgiannelos: mobileapps: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163351 [12:33:53] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/proton: apply [12:34:05] (03PS2) 10Jgreen: Change DMARC aggregate report address for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1156352 (https://phabricator.wikimedia.org/T394788) [12:34:21] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin depooling P{lvs3008.esams.wmnet} and A:liberica (T396561) [12:34:25] !log akosiaris@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host aux-k8s-worker1009.eqiad.wmnet with OS bookworm [12:34:26] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [12:34:42] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) depooling P{lvs3008.esams.wmnet} and A:liberica (T396561) [12:34:54] (03CR) 10Jgreen: [C:03+1] Swap in frnetmon1002 and remove frnetmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/1163044 (https://phabricator.wikimedia.org/T395831) (owner: 10Dwisehaupt) [12:35:09] (03CR) 10Jgreen: [C:03+2] Change DMARC aggregate report address for donate.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1156352 (https://phabricator.wikimedia.org/T394788) (owner: 10Jgreen) [12:35:14] !log vgutierrez@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on lvs3008.esams.wmnet with reason: switching to katran [12:35:17] (03CR) 10Vgutierrez: [C:03+2] hiera: Switch lvs3008 (text) to katran [puppet] - 10https://gerrit.wikimedia.org/r/1163325 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [12:35:34] seems it works [12:35:37] !log urbanecm@deploy1003 urbanecm: Continuing with sync [12:35:46] !log jgreen@dns1004 START - running authdns-update [12:36:51] !log jgreen@dns1004 END - running authdns-update [12:37:52] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:37:55] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:38:00] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:38:30] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:38:52] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:39:33] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:39:37] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:39:41] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:39:45] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:40:32] (03Abandoned) 10Jgiannelos: mobileapps: Test nodejs 20 on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162940 (owner: 10Jgiannelos) [12:40:51] (03CR) 10Jgiannelos: [C:03+2] push-notifications: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163350 (owner: 10Jgiannelos) [12:42:26] (03Merged) 10jenkins-bot: push-notifications: Deploy node20 upgrade to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163350 (owner: 10Jgiannelos) [12:42:51] !log urbanecm@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163292|[Growth] testwiki: Enable the get-started-experiment (T394958)]] (duration: 18m 18s) [12:42:57] T394958: Shorten the NotificationGetStartedJob delay on pilot wikis as an experiment - https://phabricator.wikimedia.org/T394958 [12:43:27] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.08 [12:43:30] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.09 [12:43:39] !log bump kubernetes-client to newest version on dse-k8s-ctrl100[12] - T387548 [12:43:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:45] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [12:43:58] (03CR) 10Muehlenhoff: [C:03+2] Add IDP config for debmonitor-next [puppet] - 10https://gerrit.wikimedia.org/r/1163257 (owner: 10Muehlenhoff) [12:44:19] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [12:45:07] (03PS1) 10Tchanders: Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 [12:45:09] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [12:45:43] (03PS1) 10Cathal Mooney: Add new reposync repo called 'netbox-dns-records' [puppet] - 10https://gerrit.wikimedia.org/r/1163355 (https://phabricator.wikimedia.org/T362985) [12:46:19] (03Abandoned) 10Tchanders: Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (owner: 10Tchanders) [12:47:25] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/push-notifications: apply [12:47:47] !log vgutierrez@cumin1002 START - Cookbook sre.hosts.remove-downtime for lvs3008.esams.wmnet [12:47:48] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs3008.esams.wmnet [12:48:03] (03Restored) 10Tchanders: Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (owner: 10Tchanders) [12:48:06] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [12:48:50] (03PS1) 10Vgutierrez: hiera: Repool lvs3008 using katran [puppet] - 10https://gerrit.wikimedia.org/r/1163356 (https://phabricator.wikimedia.org/T396561) [12:49:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163356 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [12:49:10] (03CR) 10Tchanders: "Restored since it is the right way to do this, after all. I wrongly thought that the config was default false in extension.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (owner: 10Tchanders) [12:50:55] !log bump kubernetes-client to newest version on aux-k8s-ctrl* - T387548 [12:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:51:00] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [12:51:53] (03PS2) 10Cathal Mooney: Add new reposync repo called 'netbox-dns-records' [puppet] - 10https://gerrit.wikimedia.org/r/1163355 (https://phabricator.wikimedia.org/T362985) [12:52:08] (03PS2) 10Tchanders: Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (https://phabricator.wikimedia.org/T395933) [12:54:04] (03CR) 10Vgutierrez: [C:03+2] hiera: Repool lvs3008 using katran [puppet] - 10https://gerrit.wikimedia.org/r/1163356 (https://phabricator.wikimedia.org/T396561) (owner: 10Vgutierrez) [12:54:11] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops: Disk (sde) failed in ms-be1071 - https://phabricator.wikimedia.org/T397343#10942672 (10Jclark-ctr) 05Openβ†’03Resolved a:03Jclark-ctr @MatthewVernon this server is out of warranty replace it with spare drive. [12:54:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.09 [12:54:53] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.0a [12:56:53] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [12:57:04] (03PS3) 10Cathal Mooney: Add new reposync repo called 'netbox-dns' [puppet] - 10https://gerrit.wikimedia.org/r/1163355 (https://phabricator.wikimedia.org/T362985) [12:57:36] (03PS1) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 [12:58:42] (03CR) 10STran: [C:03+1] Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (https://phabricator.wikimedia.org/T395933) (owner: 10Tchanders) [12:59:10] (03CR) 10Giuseppe Lavagetto: [C:03+2] fetch_external_cloud: stop depending on requestctl libraries [puppet] - 10https://gerrit.wikimedia.org/r/1161869 (owner: 10Giuseppe Lavagetto) [12:59:26] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:59:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:59:50] (03CR) 10Kosta Harlan: [C:03+1] Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (https://phabricator.wikimedia.org/T395933) (owner: 10Tchanders) [12:59:56] !log vgutierrez@cumin1002 START - Cookbook sre.loadbalancer.admin config_reloading P{lvs3008.esams.wmnet} and A:liberica (T396561) [12:59:59] !log repool lvs3008 using katran - T396561 [13:00:01] T396561: Switch to katran as forwarding plane on non-core DCs - https://phabricator.wikimedia.org/T396561 [13:00:05] Urbanecm and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1300) [13:00:05] cormacparle and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:14] !log vgutierrez@cumin1002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) config_reloading P{lvs3008.esams.wmnet} and A:liberica (T396561) [13:00:39] (03CR) 10Ssingh: [C:03+1] cache: Provide duration metrics on wmfuniq experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1163301 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:01:03] jouncebot: you may not! :) [13:01:06] (03CR) 10Ssingh: [C:03+1] "Sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/1163355 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:01:23] o/ [13:01:29] (03CR) 10Vgutierrez: [C:03+2] cache: Provide duration metrics on wmfuniq experiment fetcher [puppet] - 10https://gerrit.wikimedia.org/r/1163301 (https://phabricator.wikimedia.org/T395001) (owner: 10Vgutierrez) [13:01:34] Tchanders: wanna self-deploy? or should i help somehow? [13:01:41] o/ [13:01:48] _joe_: Giuseppe Lavagetto: fetch_external_cloud: stop depending on requestctl libraries (39171590f9) ok to merge? [13:02:01] <_joe_> vgutierrez: yeah I was about to [13:02:10] merging [13:02:16] <_joe_> thanks [13:02:48] urbanecm: It's inconvenient for both of us as we're in the same meeting... [13:03:03] cormacparle: Are you self-deploying first? [13:03:14] (done) [13:03:47] Tchanders: I can do ... haven't done a deployment in ages, these days I just click the link right? [13:04:02] Yeah, it will prompt you when to test etc [13:04:28] cool, will do that now so [13:04:53] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [13:05:41] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.0a [13:05:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.0b [13:06:34] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942719 (10Ladsgroup) exim4 queue is growing without bound: https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-2d&to=now&timezone=ut... [13:06:43] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786#10942721 (10Clement_Goubert) a:03Clement_Goubert [13:07:07] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942726 (10Ladsgroup) {F62443098} [13:07:55] (03PS2) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 [13:07:59] (03CR) 10ClΓ©ment Goubert: [C:03+1] mobileapps: increase memory limit, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163345 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [13:08:14] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942733 (10Ladsgroup) hmm that queue doesn't belong to exim4, it's on mailman3. I restarted the service for mailman3 too. [13:08:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:08:41] urbanecm: I can login to CAS, but not spiderpig :( [13:09:03] cormacparle: No worries, I can do it [13:09:21] I'll get yours going now... [13:09:35] cormacparle: what do you mean by "can't"? [13:09:44] happy to advise if you share err message [13:09:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942746 (10Ladsgroup) I see smtp logs in mailman3 advancing. Let's see if it's getting there [13:10:18] "Authentication Failure [13:10:18] Service access denied due to missing privileges. " [13:10:56] (03PS2) 10Tchanders: InitialiseSettings: Enable TemplateDiscovery on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163319 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [13:11:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942754 (10Ladsgroup) The queue is going down slowly: (note that the base is not zero) {F62443127} [13:11:58] cormacparle: looks like you are not in the correct LDAP group. can be requested via https://idm.wikimedia.org/permissions/, should be easy to fix as you have deployment access already. [13:12:13] cormacparle: Any chance of a +1 from someone on your patch? [13:12:43] (03CR) 10Cparle: [C:03+1] InitialiseSettings: Enable TemplateDiscovery on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163319 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [13:12:53] Tchanders: done [13:13:14] urbanecm: cool [13:13:21] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942762 (10ABran-WMF) Thanks for the debug @Ladsgroup ` Jun 23 06:02:21 lists1004 mailman3[465169]: Jun 23 06:02:21 2025 (465169) Running detector:... [13:13:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163319 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [13:13:47] cormacparle: Thanks [13:13:53] (03CR) 10Dreamy Jazz: [C:03+1] Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (https://phabricator.wikimedia.org/T395933) (owner: 10Tchanders) [13:14:33] (03Merged) 10jenkins-bot: InitialiseSettings: Enable TemplateDiscovery on almost all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163319 (https://phabricator.wikimedia.org/T377975) (owner: 10Samwilson) [13:15:00] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1163319|InitialiseSettings: Enable TemplateDiscovery on almost all wikis (T377975)]] [13:15:05] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [13:15:20] (03CR) 10Cathal Mooney: [C:03+2] Add new reposync repo called 'netbox-dns' [puppet] - 10https://gerrit.wikimedia.org/r/1163355 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [13:15:57] (03CR) 10CI reject: [V:04-1] reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [13:16:39] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10942780 (10Jclark-ctr) Replaced drive with 300gb ssd @btullis can you verify it is good prior to closing? [13:17:11] !log tchanders@deploy1003 tchanders, samwilson: Backport for [[gerrit:1163319|InitialiseSettings: Enable TemplateDiscovery on almost all wikis (T377975)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:17:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.0b [13:17:32] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.0c [13:19:18] cormacparle: It's ready for you to test [13:19:55] (03PS1) 10Muehlenhoff: Add debmonitor-next.w.o [dns] - 10https://gerrit.wikimedia.org/r/1163363 (https://phabricator.wikimedia.org/T397696) [13:20:22] Tchanders: seems to be working fine [13:21:10] (03PS3) 10Ayounsi: reimage: add MAC address support for physical hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 [13:21:14] Alright, continuing... [13:21:40] (03PS1) 10Cwhite: logstash: provide default for when age field is nil [puppet] - 10https://gerrit.wikimedia.org/r/1163364 [13:22:28] (03PS2) 10Muehlenhoff: Add debmonitor-next.w.o [dns] - 10https://gerrit.wikimedia.org/r/1163363 (https://phabricator.wikimedia.org/T397696) [13:22:34] !log tchanders@deploy1003 tchanders, samwilson: Continuing with sync [13:23:09] (03PS4) 10Giuseppe Lavagetto: haproxy: remove conditionals on wikimedia_trust [puppet] - 10https://gerrit.wikimedia.org/r/1152894 [13:23:09] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:ulsfo and A:cp - 9.2.11 upgrade (T397456) [13:23:15] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [13:24:57] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942813 (10Ladsgroup) >>! In T397642#10942762, @ABran-WMF wrote: > There was an issue on the detectors that matches that issue's timing This should... [13:26:08] (03PS1) 10Zoranzoki21: Enable block feature for AbuseFilter on all small Serbian wikiprojects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) [13:26:52] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10942822 (10ABran-WMF) I don't think it is worth fixing on our end either, I'll follow up with a bug report on upstream with more context [13:27:10] Hi, I'm hoping I'm not late for deplyoment window. I'd like to get one conifg patch deployed in this window. [13:27:13] *config [13:27:19] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1163365 [13:29:21] !log akosiaris@cumin1003 START - Cookbook sre.hosts.provision for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:29:28] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163319|InitialiseSettings: Enable TemplateDiscovery on almost all wikis (T377975)]] (duration: 14m 28s) [13:29:34] (03PS1) 10Majavah: P:toolforge: Install components-cli on bastions [puppet] - 10https://gerrit.wikimedia.org/r/1163367 (https://phabricator.wikimedia.org/T397718) [13:29:34] T377975: Enable template favouriting on Beta, pilot wikis, and test - https://phabricator.wikimedia.org/T377975 [13:29:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.0c [13:29:39] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.0d [13:29:43] !log akosiaris@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host build2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [13:30:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:30:45] Starting the next config... [13:32:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (https://phabricator.wikimedia.org/T395933) (owner: 10Tchanders) [13:32:18] 06SRE, 10MW-on-K8s, 06serviceops, 06Traffic, 10Release-Engineering-Team (Seen): Deploy mediawiki kubernetes services - https://phabricator.wikimedia.org/T321786#10942871 (10Clement_Goubert) 05In progressβ†’03Resolved [13:33:05] (03Merged) 10jenkins-bot: Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163354 (https://phabricator.wikimedia.org/T395933) (owner: 10Tchanders) [13:33:26] !log tchanders@deploy1003 Started scap sync-world: Backport for [[gerrit:1163354|Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" (T395933)]] [13:33:31] T395933: Enable the temporary accounts onboarding dialog on WMF wikis - https://phabricator.wikimedia.org/T395933 [13:33:44] !log bking@cumin2002 conftool action : set/pooled=false; selector: dnsdisc=wdqs,name=codfw [13:34:16] (03CR) 10Btullis: [C:03+2] dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [13:34:47] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-text_eqiad - 7.1.1-2~bpo11+wmf2 upgrade (T396581) [13:34:53] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [13:34:56] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqiad - 7.1.1-2~bpo11+wmf2 upgrade (T396581) [13:35:44] !log tchanders@deploy1003 tchanders: Backport for [[gerrit:1163354|Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" (T395933)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:36:13] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload_eqiad - 7.1.1-2~bpo11+wmf2 upgrade (T396581) [13:36:22] Tchanders: is my one finished? [13:37:06] !log vgutierrez@cumin1002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_eqiad - 7.1.1-2~bpo11+wmf2 upgrade (T396581) [13:38:11] cormacparle: sorry - yes! [13:38:25] Tested my one, looks good, continuing... [13:38:25] cool, thank you! [13:38:34] !log tchanders@deploy1003 tchanders: Continuing with sync [13:41:01] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.0d [13:41:04] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.0e [13:41:06] (03Merged) 10jenkins-bot: dse-k8s: Configure limitranges for mediawiki-dumps-legacy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163341 (https://phabricator.wikimedia.org/T394389) (owner: 10Btullis) [13:42:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10942925 (10Stevemunene) Reimaging `an-worker1176` due to missing root partition ` Gave up waiting for root file syste... [13:42:35] (03CR) 10Elukey: [C:03+1] Add debmonitor-next.w.o [dns] - 10https://gerrit.wikimedia.org/r/1163363 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [13:43:22] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:43:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:43:41] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:44:14] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [13:44:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10942930 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-wo... [13:44:25] (03CR) 10KartikMistry: "Should we merge this before we deploy S3 patch for the entrypoint? ie https://gerrit.wikimedia.org/r/c/mediawiki/services/machinetranslati" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162900 (owner: 10Klausman) [13:44:33] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:14] !log tchanders@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163354|Revert^2 "Enable temporary accounts onboarding dialog on WMF wikis" (T395933)]] (duration: 11m 48s) [13:45:21] T395933: Enable the temporary accounts onboarding dialog on WMF wikis - https://phabricator.wikimedia.org/T395933 [13:45:57] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:drmrs and A:cp - 9.2.11 upgrade (T397456) [13:46:03] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [13:48:02] !log UTC afternoon deploys done [13:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:06] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.0e [13:51:07] Tchanders: You missed my config patch, but okay, I'll move it to the next backport window. [13:51:09] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.0f [13:52:02] Kizule: Ah, I missed that it got added during the window [13:52:38] It's okay, I moved it to the next one. [13:52:49] Ok [13:53:18] (03CR) 10Andrew Bogott: [C:03+2] Neutron policy: Update port policies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163003 (owner: 10Andrew Bogott) [13:54:06] (03PS1) 10Volans: src_packages: add migration for OS model [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163368 (https://phabricator.wikimedia.org/T368744) [13:54:09] (03PS1) 10Volans: kubernetes: add a new kubernetes section [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163369 (https://phabricator.wikimedia.org/T397696) [13:55:21] !log bump kubernetes-client to newest version on ml_serve-ctrl* - T387548 [13:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:28] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [13:56:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.298s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:58:48] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye [13:58:53] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10942986 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker... [13:59:27] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1176.eqiad.wmnet with OS bullseye [13:59:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10942999 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1002 for host an-wo... [14:01:16] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1050.eqiad.wmnet [14:01:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.298s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:01:35] (03PS6) 10Fabfur: install_server: UEFI setup for cp20[43-58] [puppet] - 10https://gerrit.wikimedia.org/r/1162840 (https://phabricator.wikimedia.org/T392851) [14:02:28] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2043.codfw.wmnet [14:02:43] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.0f [14:02:45] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.10 [14:02:54] (03CR) 10Hnowlan: [C:03+2] mobileapps: increase memory limit, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163345 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [14:03:53] jouncebot: nowandnext [14:03:53] No deployments scheduled for the next 0 hour(s) and 26 minute(s) [14:03:53] In 0 hour(s) and 26 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1430) [14:04:17] !log bump kubernetes-client to newest version on wikikube-ctrl100[1-4] - T387548 [14:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:22] T387548: Fix alternatives entries in helm and kubernetes-client packages - https://phabricator.wikimedia.org/T387548 [14:04:26] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10943012 (10ABran-WMF) Ticket opened on mailman's issue tracker: https://gitlab.com/mailman/mailman/-/issues/1227 [14:04:32] (03Merged) 10jenkins-bot: mobileapps: increase memory limit, drop replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163345 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [14:06:14] (03PS1) 10Joely Rooke WMDE: Activate feature to resolve wikibase link labels in pilot wiki changelists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) [14:07:07] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1050.eqiad.wmnet [14:07:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.10 [14:07:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.11 [14:08:30] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2043.codfw.wmnet [14:11:05] jouncebot: now [14:11:05] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [14:11:29] (03CR) 10Effie Mouzeli: [C:03+1] Switch mc-wf1001 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1161508 (owner: 10Muehlenhoff) [14:11:49] jouncebot: nowandnext [14:11:49] No deployments scheduled for the next 0 hour(s) and 18 minute(s) [14:11:49] In 0 hour(s) and 18 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1430) [14:12:21] (03CR) 10Effie Mouzeli: [C:03+1] Clean up EventBus and jobs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163323 (https://phabricator.wikimedia.org/T397367) (owner: 10Ladsgroup) [14:13:36] (03CR) 10Effie Mouzeli: [C:03+2] otel: add tolerations for mw-experimental hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:14:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.11 [14:14:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.12 [14:15:36] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [14:16:04] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [14:18:16] (03PS1) 10Dreamy Jazz: Fix broken German translation causing message to not render [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163378 [14:18:33] Anyone mind if I deploy? [14:19:52] (03Merged) 10jenkins-bot: otel: add tolerations for mw-experimental hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1160173 (https://phabricator.wikimedia.org/T396767) (owner: 10Effie Mouzeli) [14:20:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.12 [14:20:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.13 [14:20:17] (03CR) 10Dreamy Jazz: [C:03+2] Fix broken German translation causing message to not render [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163378 (owner: 10Dreamy Jazz) [14:20:55] Dreamy_Jazz: can you hold on for a wee bit? in theory, what I am about to deploy, will not have much impact [14:20:58] (03PS16) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:21:06] (03CR) 10JHathaway: "looks good, a couple of questions. Not sure about the error handling. What do we expect users to do when the see the errors? Should we ask" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [14:21:11] (03PS2) 10Scott French: wmnet: direct swift-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1163055 (https://phabricator.wikimedia.org/T376237) [14:21:13] Okay. I just +2'd the change I was going to backport [14:21:21] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_records Generate and push DNS records from Netbox data [14:21:22] cool thankx [14:21:22] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_records (exit_code=99) Generate and push DNS records from Netbox data [14:21:45] Will you be backporting before my change would merge or do you need me to stop gate-and-submit-wmf for it? [14:22:00] (03CR) 10JHathaway: [C:03+1] "looks good" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [14:22:07] !log stevemunene@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-worker1176.eqiad.wmnet with OS bullseye [14:22:12] (03PS17) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:22:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Upgrade an-worker hard drives from 4TB to 8TB (group 9 - rack E3) - https://phabricator.wikimedia.org/T390176#10943098 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1002 for host an-worker... [14:22:20] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_records Generate and push DNS records from Netbox data [14:22:21] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_records (exit_code=99) Generate and push DNS records from Netbox data [14:22:25] !log jhancock@cumin1003 START - Cookbook sre.dns.netbox [14:22:56] (03PS18) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:23:01] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_records Generate and push DNS records from Netbox data [14:23:02] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_records (exit_code=99) Generate and push DNS records from Netbox data [14:23:49] (03PS19) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [14:23:57] (03CR) 10Dreamy Jazz: Fix broken German translation causing message to not render [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163378 (owner: 10Dreamy Jazz) [14:23:59] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_records Generate and push DNS records from Netbox data [14:24:33] (03CR) 10Scott French: "Thanks for the reviews!" [dns] - 10https://gerrit.wikimedia.org/r/1163055 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [14:25:01] !log jiji@deploy1003 helmfile [codfw] START helmfile.d/admin 'apply'. [14:25:03] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_records (exit_code=99) Generate and push DNS records from Netbox data [14:25:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.13 [14:25:35] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.14 [14:25:46] !log jhancock@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2005-dev to codfw - jhancock@cumin1003" [14:25:47] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1049.eqiad.wmnet [14:25:51] !log jhancock@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding cloudcephosd2005-dev to codfw - jhancock@cumin1003" [14:25:51] !log jhancock@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:25:56] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2005-dev [14:25:57] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2006-dev [14:25:58] !log jhancock@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host cloudcephosd2007-dev [14:26:03] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host cloudcephosd2006-dev [14:26:04] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2005-dev [14:26:08] !log jhancock@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudcephosd2007-dev [14:26:19] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:26:37] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:26:47] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:26:48] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2044.codfw.wmnet [14:26:58] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:27:12] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:27:20] (03CR) 10Scott French: [C:03+2] wmnet: direct swift-ro DYNA record to failoid [dns] - 10https://gerrit.wikimedia.org/r/1163055 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [14:27:44] !log jiji@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:27:55] (03PS1) 10Ebernhardson: rdf-streaming-updater: Update codfw savepoint path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163380 [14:28:01] !log swfrench@dns1004 START - running authdns-update [14:28:20] !log jiji@deploy1003 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:29:08] !log jiji@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:29:14] !log swfrench@dns1004 END - running authdns-update [14:29:15] effie: Can you ping me when you are done? [14:29:43] sure sure [14:29:59] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [14:30:05] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1430) [14:30:41] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:31:22] Dreamy_Jazz: done [14:31:24] jhancock@cumin1003 provision (PID 2998702) is awaiting input [14:31:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.14 [14:31:27] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.15 [14:31:29] jhancock@cumin1003 provision (PID 2998753) is awaiting input [14:31:30] jhancock@cumin1003 provision (PID 2998778) is awaiting input [14:31:30] Thanks [14:31:40] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1049.eqiad.wmnet [14:31:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy1003 using scap backport" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163378 (owner: 10Dreamy Jazz) [14:32:18] (03PS1) 10Cathal Mooney: Netbox hosts: add netbox-dns reposync repo so it is available [puppet] - 10https://gerrit.wikimedia.org/r/1163382 (https://phabricator.wikimedia.org/T362985) [14:32:31] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2044.codfw.wmnet [14:34:02] (03Merged) 10jenkins-bot: Fix broken German translation causing message to not render [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163378 (owner: 10Dreamy Jazz) [14:34:28] !log dreamyjazz@deploy1003 Started scap sync-world: Backport for [[gerrit:1163378|Fix broken German translation causing message to not render]] [14:34:34] (03CR) 10Ebernhardson: [C:03+2] rdf-streaming-updater: Update codfw savepoint path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163380 (owner: 10Ebernhardson) [14:35:06] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:35:11] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:35:24] !log jhancock@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [14:36:16] (03Merged) 10jenkins-bot: rdf-streaming-updater: Update codfw savepoint path [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163380 (owner: 10Ebernhardson) [14:36:40] (03CR) 10Jforrester: "Does someone else have a machine or docker image that the random Python for logos can run on? I can't…" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [14:37:05] !log dreamyjazz@deploy1003 dreamyjazz: Backport for [[gerrit:1163378|Fix broken German translation causing message to not render]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:37:44] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.15 [14:37:47] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.16 [14:37:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:37:51] !log dreamyjazz@deploy1003 dreamyjazz: Continuing with sync [14:38:30] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:39:26] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc1048.eqiad.wmnet [14:40:19] PROBLEM - LDAP -writable server- on seaborgium is CRITICAL: Could not bind to the LDAP server https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [14:40:41] !log jiji@cumin1003 START - Cookbook sre.hosts.reboot-single for host mc2045.codfw.wmnet [14:42:13] (03PS1) 10Hnowlan: Revert "mobileapps: increase memory limit, drop replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163384 [14:42:17] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1070681 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:42:19] RECOVERY - LDAP -writable server- on seaborgium is OK: LDAP OK - 0.007 seconds response time https://wikitech.wikimedia.org/wiki/LDAP%23Troubleshooting [14:42:31] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090583 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:42:45] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1090585 (https://phabricator.wikimedia.org/T352245) (owner: 10Scott French) [14:43:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.16 [14:43:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.17 [14:43:49] (03PS1) 10Cwhite: logstash: enable filter_ecs_cleanup_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163386 (https://phabricator.wikimedia.org/T234565) [14:44:21] (03CR) 10CI reject: [V:04-1] logstash: enable filter_ecs_cleanup_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163386 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [14:44:57] !log dreamyjazz@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163378|Fix broken German translation causing message to not render]] (duration: 10m 29s) [14:44:58] (03CR) 10Hnowlan: [C:03+2] Revert "mobileapps: increase memory limit, drop replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163384 (owner: 10Hnowlan) [14:45:07] (03PS2) 10Cwhite: logstash: enable filter_ecs_cleanup_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163386 (https://phabricator.wikimedia.org/T234565) [14:45:18] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1048.eqiad.wmnet [14:46:26] !log jiji@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2045.codfw.wmnet [14:46:45] (03Merged) 10jenkins-bot: Revert "mobileapps: increase memory limit, drop replicas" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163384 (owner: 10Hnowlan) [14:47:28] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [14:47:51] FIRING: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:47:53] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:ulsfo and A:cp - 9.2.11 upgrade (T397456) [14:47:58] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [14:47:59] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [14:48:27] (03CR) 10David Caro: [C:03+1] P:toolforge: Install components-cli on bastions [puppet] - 10https://gerrit.wikimedia.org/r/1163367 (https://phabricator.wikimedia.org/T397718) (owner: 10Majavah) [14:48:51] (03CR) 10Majavah: [C:03+2] P:toolforge: Install components-cli on bastions [puppet] - 10https://gerrit.wikimedia.org/r/1163367 (https://phabricator.wikimedia.org/T397718) (owner: 10Majavah) [14:50:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.17 [14:50:14] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.18 [14:51:11] (03CR) 10Andrew Bogott: [C:03+2] Neutron policy.yaml: update subnetpool rules [puppet] - 10https://gerrit.wikimedia.org/r/1163016 (owner: 10Andrew Bogott) [14:52:11] (03PS1) 10Jforrester: FunctionEvaluator.vue: prod bug - js error for functions with Typed list as input param [extensions/WikiLambda] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163390 (https://phabricator.wikimedia.org/T397682) [14:52:51] RESOLVED: [2x] SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [14:54:40] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10943218 (10Ladsgroup) 05Openβ†’03Resolved a:03Ladsgroup https://grafana.wikimedia.org/d/GvuAmuuGk/mailman3?orgId=1&from=now-12h&to=now&timezo... [14:56:51] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.18 [14:56:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.19 [14:58:25] (03PS1) 10Jhancock.wm: Adding and updating sretest200X servers [puppet] - 10https://gerrit.wikimedia.org/r/1163392 (https://phabricator.wikimedia.org/T396365) [15:00:05] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1500). [15:02:55] (03PS20) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [15:03:02] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox_records Generate and push DNS records from Netbox data [15:03:16] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.19 [15:03:18] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.1a [15:03:39] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phorge Update [15:03:56] !log brennen@deploy1003 Started deploy [phabricator/deployment@adb2373]: test deploy phab2002 for T397726 [15:04:02] T397726: Deploy Phabricator/Phorge 2025-06-24 - https://phabricator.wikimedia.org/T397726 [15:04:38] !log brennen@deploy1003 Finished deploy [phabricator/deployment@adb2373]: test deploy phab2002 for T397726 (duration: 00m 41s) [15:05:00] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phorge Update [15:05:14] !log brennen@deploy1003 Started deploy [phabricator/deployment@adb2373]: deploy phab1004 for T397726 [15:05:18] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox_records (exit_code=99) Generate and push DNS records from Netbox data [15:05:52] !log brennen@deploy1003 Finished deploy [phabricator/deployment@adb2373]: deploy phab1004 for T397726 (duration: 00m 38s) [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:23] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [15:09:54] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.1a [15:09:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.1b [15:14:20] (03PS1) 10Muehlenhoff: Remove external cloud sync from Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1163399 [15:15:13] (03CR) 10CDanis: [C:03+2] Add debmonitor-next.w.o [dns] - 10https://gerrit.wikimedia.org/r/1163363 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [15:15:18] (03PS21) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [15:15:21] (03PS3) 10Muehlenhoff: Add debmonitor-next.w.o [dns] - 10https://gerrit.wikimedia.org/r/1163363 (https://phabricator.wikimedia.org/T397696) [15:16:18] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [15:16:35] (03PS1) 10JMeybohm: k8s.wipe-cluster: Run puppet in batches of 50 [cookbooks] - 10https://gerrit.wikimedia.org/r/1163401 (https://phabricator.wikimedia.org/T397148) [15:16:36] (03PS1) 10JMeybohm: sre.wipe-cluster: Ask user to confirm target k8s version [cookbooks] - 10https://gerrit.wikimedia.org/r/1163402 (https://phabricator.wikimedia.org/T397148) [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:58] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.1b [15:17:01] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.1c [15:17:03] (03CR) 10CDanis: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1163363 (https://phabricator.wikimedia.org/T397696) (owner: 10Muehlenhoff) [15:17:17] (03Abandoned) 10Jhancock.wm: Adding and updating sretest200X servers [puppet] - 10https://gerrit.wikimedia.org/r/1163392 (https://phabricator.wikimedia.org/T396365) (owner: 10Jhancock.wm) [15:19:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163372 (https://phabricator.wikimedia.org/T388685) (owner: 10Joely Rooke WMDE) [15:19:51] !log cdanis@dns1004 START - running authdns-update [15:20:46] !log cdanis@dns1004 END - running authdns-update [15:21:02] oops [15:21:11] (03PS2) 10Muehlenhoff: Remove external cloud sync from Puppet 5 frontends [puppet] - 10https://gerrit.wikimedia.org/r/1163399 [15:21:11] need to merge the patch first :3 [15:21:14] !log cdanis@dns1004 START - running authdns-update [15:22:10] (03CR) 10Filippo Giunchedi: [C:03+2] Swap in frnetmon1002 and remove frnetmon1001 [puppet] - 10https://gerrit.wikimedia.org/r/1163044 (https://phabricator.wikimedia.org/T395831) (owner: 10Dwisehaupt) [15:22:13] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [15:22:14] !log cdanis@dns1004 END - running authdns-update [15:23:35] (03PS1) 10Scott French: wmnet: remove swift-r[ow] DYNA records and mock resources (1/3) [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) [15:23:44] (03PS2) 10Scott French: hieradata: remove swift-r[ow] from service catalog (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163397 (https://phabricator.wikimedia.org/T376237) [15:23:55] (03PS2) 10Scott French: conftool-data: remove swift-r[ow] discovery entities (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163398 (https://phabricator.wikimedia.org/T376237) [15:24:19] (03PS1) 10Stang: Fix missing Chinese translation related to temporary accounts [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163403 [15:25:00] (03PS2) 10Scott French: wmnet: remove swift-r[ow] DYNA records and mock resources (1/3) [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) [15:26:19] (03PS22) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [15:27:39] (03CR) 10Stang: "cherry-pick from 453c7ff0c and 94043223b" [extensions/WikimediaMessages] (wmf/1.45.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1163403 (owner: 10Stang) [15:28:24] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [15:28:52] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10943403 (10Scott_French) [15:29:14] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:drmrs and A:cp - 9.2.11 upgrade (T397456) [15:29:22] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [15:29:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.1c [15:29:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.1d [15:29:48] (03PS1) 10Ilias Sarantopoulos: ores-extension: enable revertrisk filter in UI for third batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) [15:32:17] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [15:32:40] (03CR) 10Muehlenhoff: "Can't make much sense of the PCC failure, the PCC states the catalogue didn't change... https://puppet-compiler.wmflabs.org/output/1163399" [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [15:34:58] (03PS1) 10Scott French: hieradata: remove swift-r[ow] SAN entries (cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/1163407 (https://phabricator.wikimedia.org/T376237) [15:35:18] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163407 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [15:35:55] 10SRE-swift-storage, 06serviceops, 07Datacenter-Switchover, 13Patch-For-Review: Turn down unused swift-r[ow] discovery services - https://phabricator.wikimedia.org/T376237#10943437 (10Scott_French) [15:36:34] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.1d [15:36:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.1e [15:37:20] !log dancy@deploy1003 Started scap sync-world: testing [15:40:21] !log dancy@deploy1003 Finished scap sync-world: testing (duration: 03m 01s) [15:41:06] (03PS1) 10Muehlenhoff: debmonitor_dev: Use the IP as django_mysql_db_host [puppet] - 10https://gerrit.wikimedia.org/r/1163410 [15:41:32] (03CR) 10Volans: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1163410 (owner: 10Muehlenhoff) [15:42:25] (03CR) 10Muehlenhoff: [C:03+2] debmonitor_dev: Use the IP as django_mysql_db_host [puppet] - 10https://gerrit.wikimedia.org/r/1163410 (owner: 10Muehlenhoff) [15:42:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.1e [15:43:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.1f [15:43:48] FIRING: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:48:55] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.1f [15:48:58] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.20 [15:49:02] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) (owner: 10Ilias Sarantopoulos) [15:49:09] (03CR) 10MVernon: [C:03+1] wmnet: remove swift-r[ow] DYNA records and mock resources (1/3) [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [15:49:29] (03PS2) 10Ilias Sarantopoulos: ores-extension: enable revertrisk filter in UI for third batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) [15:49:45] (03CR) 10MVernon: [C:03+1] hieradata: remove swift-r[ow] from service catalog (2/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163397 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [15:50:03] (03CR) 10MVernon: [C:03+1] conftool-data: remove swift-r[ow] discovery entities (3/3) [puppet] - 10https://gerrit.wikimedia.org/r/1163398 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [15:50:25] (03CR) 10MVernon: [C:03+1] hieradata: remove swift-r[ow] SAN entries (cleanup) [puppet] - 10https://gerrit.wikimedia.org/r/1163407 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [15:52:26] (03PS23) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [15:53:48] RESOLVED: PuppetFailure: Puppet has failed on wdqs2023:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:57:45] (03CR) 10Ssingh: [C:03+1] wmnet: remove swift-r[ow] DYNA records and mock resources (1/3) [dns] - 10https://gerrit.wikimedia.org/r/1163396 (https://phabricator.wikimedia.org/T376237) (owner: 10Scott French) [15:58:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.20 [15:58:28] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.21 [15:59:33] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [16:00:05] jhathaway and moritzm: Time to do the Puppet request window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:05:36] (03PS24) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [16:08:43] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-upload_eqiad - 7.1.1-2~bpo11+wmf2 upgrade (T396581) [16:08:50] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [16:08:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.21 [16:08:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.22 [16:09:31] !log vgutierrez@cumin1002 END (FAIL) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=1) rolling upgrade of Varnish on A:cp-text_eqiad - 7.1.1-2~bpo11+wmf2 upgrade (T396581) [16:12:18] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [16:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [16:20:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.22 [16:20:02] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.23 [16:21:14] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp11[12,14].eqiad.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:21:19] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [16:22:36] (03PS1) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [16:23:01] (03CR) 10CI reject: [V:04-1] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [16:23:47] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp11[13,15].eqiad.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:24:46] (03PS1) 10Ebernhardson: cirrus: Start AB test of completion suggester fuzziness [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163415 (https://phabricator.wikimedia.org/T397732) [16:25:20] (03PS25) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [16:25:24] (03PS2) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [16:25:37] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox-records Generate and push DNS records from Netbox data [16:26:01] (03CR) 10CI reject: [V:04-1] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [16:26:24] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.dns.netbox-records (exit_code=99) Generate and push DNS records from Netbox data [16:28:43] (03PS3) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [16:30:02] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.23 [16:30:04] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.24 [16:32:31] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1111.* [16:32:54] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [16:33:34] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp1110.* [16:41:45] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:42:25] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.24 [16:42:27] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:42:27] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.25 [16:44:35] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:44:42] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:44:50] (03PS1) 10Volans: kubernetes: add API to update data [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) [16:45:40] (03CR) 10Volans: "Tests yet to be writted, based on a payload of the format specified in:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/1163420 (https://phabricator.wikimedia.org/T397696) (owner: 10Volans) [16:48:18] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [16:48:26] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [16:52:41] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp11[12,14].eqiad.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:52:46] T396581: varnish 7.1.1-2~bpo11+wmf1 crash - https://phabricator.wikimedia.org/T396581 [16:53:07] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on P{cp11[13,15].eqiad.wmnet} and A:cp - Fix VSLbs() assert error and upgrade libvmod-wmfuniq to 0.2.0 (T396581) [16:53:54] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.25 [16:53:55] (03PS4) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [16:53:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.26 [16:59:41] (03CR) 10Scott French: "Thanks for the review, Effie!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162962 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [16:59:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:05] swfrench-wmf: Time to snap out of that daydream and deploy MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1700). [17:00:17] (03PS5) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [17:00:22] o/ [17:00:43] (03CR) 10CI reject: [V:04-1] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [17:02:39] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): pilot 5% of traffic on new httpd images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162962 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:03:41] (03PS6) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [17:04:18] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.26 [17:04:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.27 [17:04:27] (03Merged) 10jenkins-bot: mw-(api-ext|web): pilot 5% of traffic on new httpd images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1162962 (https://phabricator.wikimedia.org/T378128) (owner: 10Scott French) [17:05:29] (03PS7) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [17:05:52] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10943991 (10BTullis) Thanks @Jclark-ctr - I did the following. ` btullis@analytics1073:~$ sudo megacli -PdReplaceMissing -PhysDrv [32:12] -Array0 -row0 -a0... [17:08:30] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:08:47] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [17:09:08] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [17:09:09] !log swfrench@deploy1003 helmfile [codfw] START helmfile.d/services/mw-web: apply [17:09:26] !log swfrench@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [17:11:02] (03PS8) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [17:11:18] !log serving ~ 5% of mw-api-ext and mw-web traffic in codfw via bookworm-based httpd image - T378128 [17:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:24] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:11:28] (03CR) 10CI reject: [V:04-1] p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [17:12:19] (03PS9) 10David Caro: p:toolforge::prometheus: add enable_query_log option [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) [17:13:28] RECOVERY - MegaRAID on analytics1073 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:13:54] (03CR) 10David Caro: [V:03+1] "Deployed and tested in tools" [puppet] - 10https://gerrit.wikimedia.org/r/1163414 (https://phabricator.wikimedia.org/T397563) (owner: 10David Caro) [17:13:57] !log cgoubert@cumin1003 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-worker-codfw [17:13:57] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.27 [17:14:00] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.28 [17:14:33] FIRING: [10x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:18:30] FIRING: [11x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:21:44] (03PS1) 10Hnowlan: mobileapps: bump memory limits without scaling down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163429 (https://phabricator.wikimedia.org/T397750) [17:21:48] 10SRE-SLO, 10EditCheck, 10Editing-team (Kanban Board), 07Essential-Work, 13Patch-For-Review: Fix EditCheck's SLO metrics and create a dashboard for it - https://phabricator.wikimedia.org/T395444#10944051 (10ppelberg) Per offline team meeting, next step is for @elukey to review the changes @Dlynch is prop... [17:22:19] (03PS26) 10Cathal Mooney: WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 [17:23:55] (03PS1) 10Dwisehaupt: icinga: Add frban1002 to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1163430 (https://phabricator.wikimedia.org/T395951) [17:24:17] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [17:24:19] (03CR) 10Cwhite: [C:03+2] logstash: provide default for when age field is nil [puppet] - 10https://gerrit.wikimedia.org/r/1163364 (owner: 10Cwhite) [17:24:36] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [17:24:37] !log swfrench@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-web: apply [17:24:53] !log swfrench@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [17:25:02] !log serving ~ 5% of mw-api-ext and mw-web traffic in eqiad via bookworm-based httpd image - T378128 [17:25:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:07] T378128: Upgrade httpd images to bullseye or bookworm - https://phabricator.wikimedia.org/T378128 [17:25:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.28 [17:25:38] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.29 [17:27:27] (03CR) 10Scott French: [C:03+1] mobileapps: bump memory limits without scaling down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163429 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [17:27:57] (03CR) 10Hnowlan: [C:03+2] mobileapps: bump memory limits without scaling down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163429 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [17:28:47] (03CR) 10CI reject: [V:04-1] WIP: netbox-snippets test cookbook to get started [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [17:29:34] (03Merged) 10jenkins-bot: mobileapps: bump memory limits without scaling down [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163429 (https://phabricator.wikimedia.org/T397750) (owner: 10Hnowlan) [17:29:52] jouncebot: nowandnext [17:29:52] For the next 0 hour(s) and 30 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1700) [17:29:52] In 0 hour(s) and 30 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1800) [17:30:03] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:30:31] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:30:58] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [17:33:10] ^^ we need to fix this alert, it seems to be considering CODFW as part of its calculation even though it's depooled [17:34:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:36:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.29 [17:36:56] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.2a [17:37:57] (03PS1) 10Bking: rdf-streaming-updater: point to last valid checkpoint for restore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163434 (https://phabricator.wikimedia.org/T397719) [17:40:31] (03CR) 10Bking: [C:03+2] rdf-streaming-updater: point to last valid checkpoint for restore [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163434 (https://phabricator.wikimedia.org/T397719) (owner: 10Bking) [17:41:22] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [17:41:29] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [17:42:17] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:42:21] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:42:39] !log bking@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [17:42:47] !log bking@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [17:42:48] (03PS1) 10Hnowlan: Revert "mobileapps: bump memory limits without scaling down" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163435 [17:44:33] (03PS1) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) [17:45:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [17:45:59] (03CR) 10Scott French: [C:03+1] Revert "mobileapps: bump memory limits without scaling down" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163435 (owner: 10Hnowlan) [17:46:42] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2016.codfw.wmnet, wdqs2017.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2016.codfw.wmnet, wdqs2017.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2016.codfw.wmnet, wdqs2017.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:47:03] ^^ looking in to this now [17:47:04] hello [17:47:04] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - wdqs-heavy-queries_8888: Servers wdqs2016.codfw.wmnet are marked down but pooled: wdqs-ssl_443: Servers wdqs2016.codfw.wmnet are marked down but pooled: wdqs_80: Servers wdqs2016.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:47:08] thanks inflatador! [17:47:25] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [17:47:28] sukhe np, the service is failed in CODFW but CODFW is depooled anyway [17:47:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.2a [17:47:36] (03PS1) 10Clare Ming: xLab: Deploy v0.7.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163439 (https://phabricator.wikimedia.org/T397465) [17:47:38] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.2b [17:47:45] not sure how to ack/slience that type of alert but happy to do so [17:48:36] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.06.13 - 2025.07.04): Degraded RAID on analytics1073 - https://phabricator.wikimedia.org/T397231#10944151 (10BTullis) 05Openβ†’03Resolved [17:48:40] inflatador: the hosts are pooled for the service though, that's what the error is about [17:48:50] basically: the hosts are down but marked as pool [17:48:54] {"wdqs2016.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=wdqs,service=wdqs-heavy-queries"} [17:49:05] so you should depool them here [17:49:07] and then the error goes away [17:49:38] sukhe ACK, that's a different problem then. They are severely lagged but should still be responding. Regardless, I will depool at the host level and keep troubleshooting [17:49:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:50:00] cool, please ping if we can help [17:50:03] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:50:04] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:50:08] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:50:19] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:50:42] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:51:03] interesting [17:51:12] * inflatador hasn't made any changes yet [17:52:13] FIRING: [6x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:52:40] (03CR) 10Hnowlan: [C:03+2] Revert "mobileapps: bump memory limits without scaling down" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163435 (owner: 10Hnowlan) [17:52:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2017:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:53:41] unsure if related but there's some new RDF jobs appearing in codfw at an unusually high rate https://grafana-rw.wikimedia.org/d/000300/change-propagation?forceLogin&from=now-1h&orgId=1&refresh=1m&timezone=utc&to=now&var-dc=000000017&viewPanel=panel-28 [17:54:22] (03Merged) 10jenkins-bot: Revert "mobileapps: bump memory limits without scaling down" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163435 (owner: 10Hnowlan) [17:54:37] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [17:55:14] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [17:57:12] hnowlan yeah, the rdf streaming updater died around 1100UTC yesterday, I just fixed it [17:57:13] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [17:57:29] inflatador: ah, cool [17:57:30] if it is causing problems LMK, I can stop some of the CODFW hosts [17:57:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [17:58:35] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.2b [17:58:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.2c [18:00:04] jeena and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1800) [18:00:32] (03PS2) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) [18:00:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://wikifeeds.svc.eqiad.wmnet:4101 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [18:02:04] inflatador: nah no worries at all, just happened to be looking at the changeprop graphs [18:03:50] (03CR) 10Cathal Mooney: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [18:05:12] cool, I'm going to enjoy watching the WDQS lag graphs for a couple hrs...hopefully the lag will be better by then ;) [18:05:20] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Replace date/stamp headers with vars [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:09:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.2c [18:09:24] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.2d [18:13:31] FIRING: [9x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:19:18] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.2d [18:19:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.2e [18:20:28] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2026:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:20:33] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2023:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:30:41] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:30:54] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.2e [18:30:57] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.2f [18:32:15] (03CR) 10Volans: Netbox hosts: add netbox-dns reposync repo so it is available (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163382 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [18:35:10] (03Abandoned) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [18:36:30] (03CR) 10Volans: "Indeed much DRY-er, and consistent with my comment in the other one. You read my mind :) I see though that PCC shows resources only in the" [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [18:37:10] (03Restored) 10Cathal Mooney: Netbox hosts: ensure reposync repos are set up to match cumin hosts [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [18:40:03] (03CR) 10Volans: sre.wipe-cluster: Ask user to confirm target k8s version (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163402 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [18:40:21] (03CR) 10Cathal Mooney: "Hey yeah I think the issue is that 'profile::spicerack::reposync::repos' is defined in hieradata/role/common/cluster/management.yaml, and " [puppet] - 10https://gerrit.wikimedia.org/r/1163436 (https://phabricator.wikimedia.org/T362985) (owner: 10Cathal Mooney) [18:41:47] (03CR) 10Volans: k8s.wipe-cluster: Run puppet in batches of 50 (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1163401 (https://phabricator.wikimedia.org/T397148) (owner: 10JMeybohm) [18:42:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.2f [18:42:03] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.30 [18:43:39] 06SRE, 06SRE Observability: monitoring ACKs should be delivered via SMS - https://phabricator.wikimedia.org/T396894#10944324 (10herron) 05Openβ†’03Stalled There doesn't appear to be a feature to generate a notification (push/sms/email/otherwise) on the acknowledge action in splunk oncall. There is the abili... [18:43:49] 06SRE, 06SRE Observability: monitoring ACKs should be delivered via SMS - https://phabricator.wikimedia.org/T396894#10944327 (10herron) p:05Triageβ†’03Low [18:47:11] (03PS1) 10TrainBranchBot: group0 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163447 (https://phabricator.wikimedia.org/T392177) [18:47:15] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163447 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [18:48:14] (03Merged) 10jenkins-bot: group0 to 1.45.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163447 (https://phabricator.wikimedia.org/T392177) (owner: 10TrainBranchBot) [18:48:44] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.30 [18:48:47] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.31 [18:49:12] (03PS9) 10BryanDavis: Avoid using wikitech dblist in configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [18:50:25] (03CR) 10JHathaway: Remove external cloud sync from Puppet 5 frontends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [18:50:36] (03CR) 10BryanDavis: "{{Done}}" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137266 (owner: 10Ladsgroup) [18:57:37] (03CR) 10Volans: [C:03+1] "Great work, LGTM, minor comments on comments only :D" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [18:58:17] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.31 [18:58:19] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.32 [18:58:24] !log jhuneidi@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.45.0-wmf.7 refs T392177 [18:58:30] T392177: 1.45.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T392177 [19:00:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.63s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:01:28] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2018:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:05:00] !log cdobbins@cumin2002 sudo -i cookbook sre.cdn.roll-upgrade-ats --query 'A:cp-eqsin' --task-id T397456 --reason '9.2.11 upgrade' --version '9.2.11' [19:05:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:05] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [19:05:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 1.63s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:05:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [19:05:44] !log bking@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=wdqs,name=codfw [19:06:38] !log cdobbins@cumin2002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade of ATS on A:cp-eqsin and A:cp - 9.2.11 upgrade (T397456) [19:06:44] (03CR) 10Muehlenhoff: Remove external cloud sync from Puppet 5 frontends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [19:07:56] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.32 [19:07:59] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.33 [19:09:32] (03PS1) 10Jsn.sherman: Undeploy remaining Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163451 (https://phabricator.wikimedia.org/T396250) [19:09:46] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, June 24 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163451 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [19:12:13] (03CR) 10JHathaway: [C:03+1] Remove external cloud sync from Puppet 5 frontends (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1163399 (owner: 10Muehlenhoff) [19:19:12] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.33 [19:19:15] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.34 [19:20:28] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:28:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.34 [19:28:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.35 [19:38:28] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:39:00] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.35 [19:39:03] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.36 [19:44:01] jouncebot: nowandnext [19:44:01] For the next 0 hour(s) and 15 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T1800) [19:44:01] In 0 hour(s) and 15 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T2000) [19:44:59] jeena: do you need the remaining window, or may I do a deploy? [19:45:18] zabe: you can deploy [19:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [19:46:14] thanks! [19:46:21] (03CR) 10Zabe: [C:03+2] Stop setting wgRevisionSlotsCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159552 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [19:46:33] !log zabe@deploy1003:~$ mwscript extensions/WikimediaMaintenance/migrateESRefToContentTableStage2.php wikidatawiki --delete /home/zabe/afl_text_table_deletedump/wikidatawiki --sleep 0.5 # T381599 [19:46:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:39] T381599: Migrate current references of text table rows from afl_var_dump - https://phabricator.wikimedia.org/T381599 [19:47:16] (03CR) 10AikoChou: [C:03+1] ores-extension: enable revertrisk filter in UI for third batch [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163405 (https://phabricator.wikimedia.org/T395824) (owner: 10Ilias Sarantopoulos) [19:47:19] (03Merged) 10jenkins-bot: Stop setting wgRevisionSlotsCacheExpiry [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1159552 (https://phabricator.wikimedia.org/T183490) (owner: 10Zabe) [19:47:57] !log zabe@deploy1003 Started scap sync-world: Backport for [[gerrit:1159552|Stop setting wgRevisionSlotsCacheExpiry (T183490)]] [19:48:02] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [19:49:53] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.36 [19:49:55] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.37 [19:50:16] !log zabe@deploy1003 zabe: Backport for [[gerrit:1159552|Stop setting wgRevisionSlotsCacheExpiry (T183490)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:51:57] !log zabe@deploy1003 zabe: Continuing with sync [19:56:14] zabe: yw! [19:59:25] !log zabe@deploy1003 Finished scap sync-world: Backport for [[gerrit:1159552|Stop setting wgRevisionSlotsCacheExpiry (T183490)]] (duration: 11m 28s) [19:59:30] T183490: MCR schema migration stage 4: Migrate External Store URLs (wmf production) - https://phabricator.wikimedia.org/T183490 [19:59:36] !log jhathaway@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts an-conf1002.eqiad.wmnet [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T2000). nyaa~ [20:00:05] Kizule and JSherman: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:14] !log jhathaway@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts an-conf1002.eqiad.wmnet [20:00:27] here [20:00:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.37 [20:00:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.38 [20:02:26] Kizule: around? [20:02:51] I may just get my config change going; it's a survey undeploy, so pretty low risk [20:03:48] ok, I'm going to start [20:03:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163451 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:04:51] (03Merged) 10jenkins-bot: Undeploy remaining Patroller Tools surveys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163451 (https://phabricator.wikimedia.org/T396250) (owner: 10Jsn.sherman) [20:05:14] !log jsn@deploy1003 Started scap sync-world: Backport for [[gerrit:1163451|Undeploy remaining Patroller Tools surveys (T396250)]] [20:05:20] T396250: Deploy remaining Patroller Tools surveys - https://phabricator.wikimedia.org/T396250 [20:05:31] (03CR) 10Volans: "My thoughts inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1163360 (owner: 10Ayounsi) [20:07:31] !log jsn@deploy1003 jsn: Backport for [[gerrit:1163451|Undeploy remaining Patroller Tools surveys (T396250)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:08:18] !log jsn@deploy1003 jsn: Continuing with sync [20:10:42] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 627.35 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [20:11:19] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.38 [20:11:21] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.39 [20:13:29] FIRING: [3x] SLOMetricAbsent: citoid-latency codfw - https://slo.wikimedia.org/?search=citoid-latency - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [20:15:33] !log jsn@deploy1003 Finished scap sync-world: Backport for [[gerrit:1163451|Undeploy remaining Patroller Tools surveys (T396250)]] (duration: 10m 18s) [20:15:39] T396250: Deploy remaining Patroller Tools surveys - https://phabricator.wikimedia.org/T396250 [20:17:09] Kizule: all yours (when you get here) [20:18:32] (03CR) 10Volans: "Did a first quick pass, left some comments inline on the approach." [cookbooks] - 10https://gerrit.wikimedia.org/r/1163318 (owner: 10Cathal Mooney) [20:19:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.39 [20:20:01] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.3a [20:28:39] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.3a [20:28:42] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.3b [20:31:05] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: Maillman mailing list: Emails not reaching destination - https://phabricator.wikimedia.org/T397642#10944818 (10Peachey88) [20:37:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.3b [20:38:02] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.3c [20:49:09] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.3c [20:49:12] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.3d [20:54:31] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade of ATS on A:cp-eqsin and A:cp - 9.2.11 upgrade (T397456) [20:54:31] (03PS1) 10Eevans: cassandra-dev2001: testing new data file directory names [puppet] - 10https://gerrit.wikimedia.org/r/1163466 (https://phabricator.wikimedia.org/T391544) [20:54:36] T397456: Upgrade to ATS 9.2.11 - https://phabricator.wikimedia.org/T397456 [20:55:17] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: testing new data file directory names [puppet] - 10https://gerrit.wikimedia.org/r/1163466 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [20:58:36] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.3d [20:58:38] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.3e [20:58:43] (03PS1) 10Eevans: cassandra-dev2001: actually update all directories [puppet] - 10https://gerrit.wikimedia.org/r/1163468 (https://phabricator.wikimedia.org/T391544) [20:59:08] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:59:43] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:45] (03CR) 10Eevans: [C:03+2] cassandra-dev2001: actually update all directories [puppet] - 10https://gerrit.wikimedia.org/r/1163468 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250624T2100) [21:00:08] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:08] PROBLEM - nova-compute proc minimum on cloudvirt1063 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:09] PROBLEM - nova-compute proc minimum on cloudvirt1045 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:09] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:09] PROBLEM - nova-compute proc minimum on cloudvirt1072 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:18] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:18] PROBLEM - nova-compute proc minimum on cloudvirt1069 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:20] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:20] PROBLEM - nova-compute proc minimum on cloudvirt1071 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:20] PROBLEM - nova-compute proc minimum on cloudvirt1068 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:21] (03PS1) 10Aude: Fix missing title on charts and add tests [extensions/Chart] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163469 (https://phabricator.wikimedia.org/T397755) [21:01:21] PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:24] PROBLEM - nova-compute proc minimum on cloudvirt1066 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:24] PROBLEM - nova-compute proc minimum on cloudvirt1064 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:34] PROBLEM - nova-compute proc minimum on cloudvirt1070 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:40] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:41] PROBLEM - nova-compute proc minimum on cloudvirt1075 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:41] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:41] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:42] PROBLEM - nova-compute proc minimum on cloudvirt1076 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:43] PROBLEM - nova-compute proc minimum on cloudvirt1073 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:01:44] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:18] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:02:20] RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:03:20] PROBLEM - nova-compute proc minimum on cloudvirt1074 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:04:28] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/Chart] (wmf/1.45.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1163469 (https://phabricator.wikimedia.org/T397755) (owner: 10Aude) [21:04:40] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:08] RECOVERY - nova-compute proc minimum on cloudvirt1063 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:08] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:09] RECOVERY - nova-compute proc minimum on cloudvirt1045 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:09] RECOVERY - nova-compute proc minimum on cloudvirt1072 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:18] RECOVERY - nova-compute proc minimum on cloudvirt1069 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:20] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:20] RECOVERY - nova-compute proc minimum on cloudvirt1071 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:21] RECOVERY - nova-compute proc minimum on cloudvirt1068 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:21] RECOVERY - nova-compute proc minimum on cloudvirt1074 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:24] RECOVERY - nova-compute proc minimum on cloudvirt1066 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:24] RECOVERY - nova-compute proc minimum on cloudvirt1064 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:34] RECOVERY - nova-compute proc minimum on cloudvirt1070 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:40] RECOVERY - nova-compute proc minimum on cloudvirt1075 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:41] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:41] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:41] RECOVERY - nova-compute proc minimum on cloudvirt1076 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:42] RECOVERY - nova-compute proc minimum on cloudvirt1073 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:05:44] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:08:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.3e [21:08:43] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.3f [21:18:13] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.3f [21:18:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.40 [21:26:36] (03PS1) 10Ryan Kemper: wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163475 (https://phabricator.wikimedia.org/T397719) [21:28:26] (03CR) 10Bking: [C:03+1] wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163475 (https://phabricator.wikimedia.org/T397719) (owner: 10Ryan Kemper) [21:30:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.40 [21:31:02] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.41 [21:31:25] (03CR) 10Ryan Kemper: [C:03+2] wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163475 (https://phabricator.wikimedia.org/T397719) (owner: 10Ryan Kemper) [21:33:25] (03Merged) 10jenkins-bot: wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163475 (https://phabricator.wikimedia.org/T397719) (owner: 10Ryan Kemper) [21:34:49] !log ryankemper@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [21:34:57] !log ryankemper@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [21:37:13] (03CR) 10Cwhite: [C:03+2] logstash: add filter_on_template_v2 [puppet] - 10https://gerrit.wikimedia.org/r/1154348 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:40:28] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Thu 10 Jul 2025 09:40:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [21:43:35] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.41 [21:43:37] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.42 [21:45:43] (03PS1) 10Ryan Kemper: wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163478 (https://phabricator.wikimedia.org/T397719) [21:48:02] (03PS2) 10Ryan Kemper: wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163478 (https://phabricator.wikimedia.org/T397719) [21:49:09] (03CR) 10Ryan Kemper: [C:03+2] wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163478 (https://phabricator.wikimedia.org/T397719) (owner: 10Ryan Kemper) [21:50:39] (03Merged) 10jenkins-bot: wcqs: restore from checkpoint [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163478 (https://phabricator.wikimedia.org/T397719) (owner: 10Ryan Kemper) [21:51:06] !log ryankemper@deploy1003 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [21:51:12] !log ryankemper@deploy1003 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [21:53:14] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.42 [21:53:16] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.43 [21:53:48] (03CR) 10Cwhite: [C:03+2] logstash: enable filter_ecs_cleanup_v2 on beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163386 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [21:57:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:02:58] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:03:22] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.43 [22:03:25] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.44 [22:04:16] (03PS9) 10JHathaway: Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [22:04:17] (03PS3) 10JHathaway: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [22:04:56] (03CR) 10JHathaway: Netbox: add primary_mac_address get/set (032 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [22:11:55] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 26 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) (owner: 10Zoranzoki21) [22:12:44] (03CR) 10CI reject: [V:04-1] Netbox: add primary_mac_address get/set [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [22:13:30] FIRING: [8x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:42] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.44 [22:13:44] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:13:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.45 [22:13:53] (03CR) 10CI reject: [V:04-1] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [22:17:12] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, June 25 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163365 (https://phabricator.wikimedia.org/T392363) (owner: 10Zoranzoki21) [22:18:15] (03PS1) 10ZhaoFJx: zhwiki: Permissions change for abusefilter groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1163483 (https://phabricator.wikimedia.org/T397788) [22:24:28] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.45 [22:24:30] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.46 [22:25:44] (03PS4) 10JHathaway: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [22:26:06] (03CR) 10JHathaway: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1162869 (owner: 10Ayounsi) [22:30:41] FIRING: SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetserver2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:34:52] (03CR) 10CI reject: [V:04-1] Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [22:35:56] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.46 [22:35:59] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.47 [22:39:28] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:12] (03PS1) 10Cwhite: logstash: explicitly define allowed numeric types [puppet] - 10https://gerrit.wikimedia.org/r/1163486 (https://phabricator.wikimedia.org/T234565) [22:44:28] FIRING: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:46:52] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.47 [22:46:55] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.48 [22:48:05] (03CR) 10Santiago Faci: [C:03+2] xLab: Deploy v0.7.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163439 (https://phabricator.wikimedia.org/T397465) (owner: 10Clare Ming) [22:49:37] (03Merged) 10jenkins-bot: xLab: Deploy v0.7.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1163439 (https://phabricator.wikimedia.org/T397465) (owner: 10Clare Ming) [22:56:41] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.48 [22:56:44] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.49 [22:58:24] (03PS5) 10JHathaway: Redfish: add get_primary_mac() [software/spicerack] - 10https://gerrit.wikimedia.org/r/1163314 (owner: 10Ayounsi) [23:02:58] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:07:32] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.49 [23:07:34] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4a [23:18:30] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4a [23:18:33] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4b [23:31:05] (03PS1) 10Cwhite: logstash: temporarily remove filter_on_template_v2 from beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163489 (https://phabricator.wikimedia.org/T234565) [23:31:10] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4b [23:31:13] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4c [23:31:18] (03PS2) 10Cwhite: logstash: temporarily remove filter_on_template_v2 from beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163489 (https://phabricator.wikimedia.org/T234565) [23:35:00] (03CR) 10Cwhite: [C:03+2] logstash: temporarily remove filter_on_template_v2 from beta-logs [puppet] - 10https://gerrit.wikimedia.org/r/1163489 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [23:38:32] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163490 [23:38:32] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163490 (owner: 10TrainBranchBot) [23:41:52] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4c [23:41:54] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4d [23:45:41] FIRING: [4x] SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [23:47:24] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:48:45] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2006-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:49:06] !log jhancock@cumin1003 START - Cookbook sre.hosts.provision for host cloudcephosd2007-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [23:52:40] !log mvernon@cumin1002 END (PASS) - Cookbook sre.swift.check-dbs (exit_code=0) Checking container DBs of wikipedia-commons-local-thumb.4d [23:52:43] !log mvernon@cumin1002 START - Cookbook sre.swift.check-dbs Checking container DBs of wikipedia-commons-local-thumb.4e [23:54:09] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1163490 (owner: 10TrainBranchBot) [23:59:21] !log jhancock@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephosd2005-dev.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED