[00:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1151390 [00:08:34] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1151390 (owner: 10TrainBranchBot) [00:11:04] PROBLEM - MariaDB Replica Lag: s1 on db2141 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.33 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:18:56] (03CR) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [00:20:14] (03PS3) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) [00:26:08] (03PS4) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) [00:29:13] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1151390 (owner: 10TrainBranchBot) [00:34:31] (03CR) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [00:43:34] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:57:06] FIRING: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [01:02:06] RESOLVED: MediaWikiLoginFailures: Elevated MediaWiki centrallogin failures (centralauth_error_badtoken) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLoginFailures [02:05:04] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:21:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:30:42] RECOVERY - Disk space on restbase1031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [04:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:38:13] (03PS1) 10KartikMistry: Update cxserver to 2025-05-28-042852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151408 (https://phabricator.wikimedia.org/T387229) [04:43:34] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:42] PROBLEM - Disk space on restbase1031 is CRITICAL: DISK CRITICAL - free space: /srv/sdc4 66242 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [05:07:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:42] RECOVERY - Disk space on restbase1031 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=restbase1031&var-datasource=eqiad+prometheus/ops [05:43:58] (03PS1) 10Giuseppe Lavagetto: external_clouds_vendors: add bingbot ranges [puppet] - 10https://gerrit.wikimedia.org/r/1151412 [05:47:26] (03CR) 10Giuseppe Lavagetto: [C:03+2] "I didn't see this patch, and created one which includes a further range we need. I'll merge this one and rebase mine on top of it." [puppet] - 10https://gerrit.wikimedia.org/r/1151277 (https://phabricator.wikimedia.org/T395358) (owner: 10Fabfur) [05:48:52] (03PS2) 10Giuseppe Lavagetto: external_clouds_vendors: add bingbot ranges [puppet] - 10https://gerrit.wikimedia.org/r/1151412 [05:50:24] (03CR) 10Giuseppe Lavagetto: [C:03+2] external_clouds_vendors: add bingbot ranges [puppet] - 10https://gerrit.wikimedia.org/r/1151412 (owner: 10Giuseppe Lavagetto) [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T0600) [06:02:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:03:17] (03PS1) 10Marostegui: db2186: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151416 (https://phabricator.wikimedia.org/T394884) [06:04:34] (03CR) 10Marostegui: [C:03+2] db2186: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151416 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [06:05:34] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:05:50] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:06:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76542 and previous config saved to /var/cache/conftool/dbconfig/20250528-060608-root.json [06:08:24] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:08:40] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.190 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:10:17] FIRING: [2x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1017:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:11:28] (03PS2) 10Ayounsi: Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) [06:13:05] (03CR) 10CI reject: [V:04-1] Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:14:45] (03PS3) 10Ayounsi: Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) [06:15:17] RESOLVED: [2x] ProbeDown: Service wdqs1017:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1017:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:16:02] (03CR) 10Ayounsi: "From Cathal's suggestion I bumped the duration the peer must be idle before alerting (from 2min to 10min)" [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:16:10] (03CR) 10Ayounsi: [C:03+2] Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:17:22] (03Merged) 10jenkins-bot: Add alerting for idle peering BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151235 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [06:19:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10862103 (10ayounsi) @Jhancock.wm did the PSU arrive? Can you set it up asap ? thx [06:21:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76543 and previous config saved to /var/cache/conftool/dbconfig/20250528-062113-root.json [06:21:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:13] Deploying cxserver.. [06:22:40] (03CR) 10KartikMistry: [C:03+2] Update cxserver to 2025-05-28-042852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151408 (https://phabricator.wikimedia.org/T387229) (owner: 10KartikMistry) [06:22:48] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10862105 (10SLyngshede-WMF) We are looking into some sort of central sign out, but I can't really say how that would work... [06:22:57] 06SRE, 10SRE-Access-Requests, 06Infrastructure-Foundations, 10netbox: Selena can't see objects in Netbox despite having wmf group membership - https://phabricator.wikimedia.org/T395172#10862109 (10SLyngshede-WMF) 05Open→03Resolved [06:23:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:24:11] (03Merged) 10jenkins-bot: Update cxserver to 2025-05-28-042852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151408 (https://phabricator.wikimedia.org/T387229) (owner: 10KartikMistry) [06:28:09] (03PS1) 10Slyngshede: Release version 0.1.12 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) [06:28:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.codfw.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [06:35:50] !log kartik@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [06:36:13] !log kartik@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:36:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76544 and previous config saved to /var/cache/conftool/dbconfig/20250528-063618-root.json [06:41:32] !log kartik@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:42:05] !log kartik@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:42:44] !log kartik@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:43:16] !log kartik@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:44:01] !log Updated cxserver to 2025-05-28-042852-production (T387229, T395259) [06:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:05] T387229: Unhandled Promise Rejection when MediaWiki API error returns 503 - https://phabricator.wikimedia.org/T387229 [06:44:06] T395259: Fix cxserver CI errors related to MTClient - https://phabricator.wikimedia.org/T395259 [06:51:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76545 and previous config saved to /var/cache/conftool/dbconfig/20250528-065124-root.json [06:55:58] (03PS2) 10Ilias Sarantopoulos: Revert^2 "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151276 (https://phabricator.wikimedia.org/T382171) [07:00:04] Amir1, Urbanecm, and awight: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T0700). [07:00:04] isaranto: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:19] o/ [07:02:58] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar: Requesting access to analytics-privatedata-users for Neslihan_Turan_WMDE - https://phabricator.wikimedia.org/T394395#10862202 (10WMDECyn) Approved from our end [07:03:04] I'm deploying... [07:03:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by isaranto@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151276 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [07:04:25] (03Merged) 10jenkins-bot: Revert^2 "ores-extension: enable ores extention UI in idwiki" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151276 (https://phabricator.wikimedia.org/T382171) (owner: 10Ilias Sarantopoulos) [07:05:08] !log isaranto@deploy1003 Started scap sync-world: Backport for [[gerrit:1151276|Revert^2 "ores-extension: enable ores extention UI in idwiki" (T382171)]] [07:05:13] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [07:06:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76546 and previous config saved to /var/cache/conftool/dbconfig/20250528-070629-root.json [07:07:19] !log isaranto@deploy1003 isaranto: Backport for [[gerrit:1151276|Revert^2 "ores-extension: enable ores extention UI in idwiki" (T382171)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [07:10:09] * isaranto is QAing db entries vs UI [07:10:43] (03CR) 10Muehlenhoff: doc: add php8.1 support for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [07:11:42] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) (owner: 10Slyngshede) [07:17:21] !log isaranto@deploy1003 isaranto: Continuing with sync [07:17:38] (03PS1) 10Ayounsi: Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) [07:18:52] (03CR) 10CI reject: [V:04-1] Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [07:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76547 and previous config saved to /var/cache/conftool/dbconfig/20250528-072135-root.json [07:22:46] (03CR) 10Hashar: [C:04-1] "This has squashed the patch I wrote and attached at T390070#10817176 and thus mix up a fix and the release. I will apply my patch and reba" [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) (owner: 10Slyngshede) [07:23:10] (03PS2) 10Hashar: Release version 0.1.12 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) (owner: 10Slyngshede) [07:23:10] (03PS1) 10Hashar: Block Gerrit users with uid rather than cn [software/bitu] - 10https://gerrit.wikimedia.org/r/1151552 (https://phabricator.wikimedia.org/T390070) [07:24:28] !log isaranto@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151276|Revert^2 "ores-extension: enable ores extention UI in idwiki" (T382171)]] (duration: 19m 19s) [07:24:33] T382171: Install ORES extension on idwiki - https://phabricator.wikimedia.org/T382171 [07:25:42] (03PS3) 10Ryan Kemper: relforge: disable monitoring notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [07:26:03] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1151552 (https://phabricator.wikimedia.org/T390070) (owner: 10Hashar) [07:26:08] (03CR) 10Ryan Kemper: [C:03+1] relforge: disable monitoring notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [07:26:11] (03CR) 10Slyngshede: [C:03+2] Block Gerrit users with uid rather than cn [software/bitu] - 10https://gerrit.wikimedia.org/r/1151552 (https://phabricator.wikimedia.org/T390070) (owner: 10Hashar) [07:26:47] (03CR) 10Hashar: [C:03+1] Block Gerrit users with uid rather than cn [software/bitu] - 10https://gerrit.wikimedia.org/r/1151552 (https://phabricator.wikimedia.org/T390070) (owner: 10Hashar) [07:26:58] hurray, I'm don with my backport all good [07:28:27] (03Merged) 10jenkins-bot: Block Gerrit users with uid rather than cn [software/bitu] - 10https://gerrit.wikimedia.org/r/1151552 (https://phabricator.wikimedia.org/T390070) (owner: 10Hashar) [07:28:40] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) (owner: 10Slyngshede) [07:30:04] (03CR) 10Brouberol: [C:03+2] Airflow: don't deploy the plain envoy service in a devenv [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151229 (owner: 10Brouberol) [07:31:37] 06SRE, 06Data-Engineering, 10LDAP-Access-Requests, 13Patch-For-Review: Grant Access to Product's Superset & Turnilo for SKivlehan - https://phabricator.wikimedia.org/T393626#10862256 (10MoritzMuehlenhoff) 05In progress→03Open a:05spatton→03Arnoldokoth Reassigning to the current SRE on Clinic Duty,... [07:31:58] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.12 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) (owner: 10Slyngshede) [07:34:52] (03Merged) 10jenkins-bot: Release version 0.1.12 [software/bitu] - 10https://gerrit.wikimedia.org/r/1151513 (https://phabricator.wikimedia.org/T390070) (owner: 10Slyngshede) [07:36:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76548 and previous config saved to /var/cache/conftool/dbconfig/20250528-073641-root.json [07:38:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10862304 (10Stevemunene) We no longer have any under replicated blocks on the active master {F60752845} https://grafana.... [07:41:34] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413 (10Joe) 03NEW [07:43:19] (03PS1) 10Elukey: admin_ng: Update knative-serving image versions for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) [07:43:20] (03PS1) 10Elukey: admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) [07:44:23] (03CR) 10Elukey: "Safe to be deployed anytime, it will just roll out new knative images without any new settings." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:44:49] (03PS1) 10Vgutierrez: hiera: Enable edge uniques in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1151598 (https://phabricator.wikimedia.org/T391411) [07:44:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10862324 (10Stevemunene) disabled puppet and unmounted the drives on both ` stevemunene@an-worker1148:~$ sudo disable-p... [07:45:07] (03CR) 10Elukey: "Safe to deploy anytime, this will allow us to rollout the seccomprofile safely to ml-serve-eqiad (as we did for ml-serve-codfw and staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [07:45:21] (03CR) 10Tiziano Fogli: "Unless I'm mistaken, there is no logrotate.d entry that matches the new debug log file. Do you think it would be useful to manage it with " [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [07:45:23] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10862325 (10Stevemunene) [07:45:49] !log installing nodejs security updates [07:45:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:01] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151598 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [07:46:18] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 139, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:24] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 219, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:46:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:47:25] !log installing intel-microcode security updates on Bullseye [07:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:13] (03CR) 10Tiziano Fogli: [C:03+1] prometheus: Fix homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah) [07:49:32] (03PS1) 10Slyngshede: IDM: Update to Bitu 0.1.12 [dns] - 10https://gerrit.wikimedia.org/r/1151599 [07:51:28] (03PS2) 10Elukey: admin_ng: Update knative-serving image versions for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) [07:51:28] (03PS2) 10Elukey: admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) [07:51:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2186 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76549 and previous config saved to /var/cache/conftool/dbconfig/20250528-075147-root.json [07:51:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:54:41] (03CR) 10Brouberol: [C:03+2] airflow-dev: make kubeconfig group-owned by the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1150655 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [07:59:18] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 140, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:18] (03PS1) 10Elukey: kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) [07:59:24] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 220, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:44] (03CR) 10Majavah: [C:03+2] prometheus: Fix homepage redirect [puppet] - 10https://gerrit.wikimedia.org/r/1146973 (owner: 10Majavah) [08:00:04] dancy and andre: How many deployers does it take to do MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T0800). [08:00:25] (03CR) 10CI reject: [V:04-1] kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:01:03] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1151598 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:01:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [08:02:00] (03CR) 10Elukey: "This change can be rolled out anytime." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:03:34] FIRING: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:04:48] (03PS1) 10Ayounsi: Netops: replace regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1151601 [08:06:04] (03CR) 10CI reject: [V:04-1] Netops: replace regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1151601 (owner: 10Ayounsi) [08:06:06] (03CR) 10Slyngshede: [C:03+2] IDM: Update to Bitu 0.1.12 [dns] - 10https://gerrit.wikimedia.org/r/1151599 (owner: 10Slyngshede) [08:06:09] (03PS1) 10Brouberol: deployment_server: create the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1151602 [08:06:12] (03PS2) 10Ayounsi: Netops: replace regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1151601 [08:06:31] !log slyngshede@dns1004 START - running authdns-update [08:06:52] (03CR) 10Brouberol: [C:03+1] relforge: disable monitoring notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [08:07:12] !log slyngshede@dns1004 END - running authdns-update [08:07:38] (03PS2) 10Elukey: kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) [08:07:38] (03PS1) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [08:08:52] (03CR) 10CI reject: [V:04-1] kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:08:53] (03CR) 10CI reject: [V:04-1] admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:08:56] (03CR) 10Elukey: kserve-inference: set seccomp defaults in the chart (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:09:32] (03CR) 10Elukey: "I found a leftover setting that shouldn't be used anymore (see comments in the code), I am pretty sure it is fine to rollout but we should" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [08:11:23] (03PS3) 10Elukey: kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) [08:11:23] (03PS2) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [08:12:02] (03CR) 10Muehlenhoff: [C:03+1] "PCC all fine as well: https://puppet-compiler.wmflabs.org/output/1151221/5699/" [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:12:05] (03CR) 10Muehlenhoff: [C:03+2] profile::ganeti: add magru to v6_prefixes [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:12:15] (03PS2) 10Ayounsi: profile::ganeti: add magru to v6_prefixes [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) [08:13:26] (03CR) 10Tiziano Fogli: [C:03+1] Netops: replace regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1151601 (owner: 10Ayounsi) [08:13:59] (03CR) 10Muehlenhoff: [C:03+2] profile::ganeti: add magru to v6_prefixes [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:14:01] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] profile::ganeti: add magru to v6_prefixes [puppet] - 10https://gerrit.wikimedia.org/r/1151221 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:14:42] (03PS3) 10Elukey: admin_ng: Update knative-serving image versions for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) [08:14:42] (03PS3) 10Elukey: admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) [08:14:42] (03PS4) 10Elukey: kserve-inference: set seccomp defaults in the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) [08:14:43] (03PS3) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [08:15:23] (03PS4) 10Ayounsi: Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) [08:16:30] (03PS4) 10Elukey: admin_ng: set secure-pod-defaults to "enabled" for knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151604 (https://phabricator.wikimedia.org/T369493) [08:17:00] (03PS6) 10Ayounsi: Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:19:12] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:19:47] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 8 hosts with reason: Maintenance [08:19:58] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:20:20] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus1006.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:21:25] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:21:33] (03CR) 10Ayounsi: [C:03+2] Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:21:42] (03CR) 10Ayounsi: [C:03+2] Netops: replace regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1151601 (owner: 10Ayounsi) [08:22:24] (03CR) 10Tiziano Fogli: [C:03+1] Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:22:46] (03Merged) 10jenkins-bot: Add alerting for network side routed Ganeti BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151200 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [08:23:03] (03CR) 10Muehlenhoff: [C:03+2] Initial cluster config for ganeti03 [puppet] - 10https://gerrit.wikimedia.org/r/1151209 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [08:23:13] (03CR) 10Stevemunene: [C:03+1] deployment_server: create the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1151602 (owner: 10Brouberol) [08:23:19] (03Merged) 10jenkins-bot: Netops: replace regex with stripPort [alerts] - 10https://gerrit.wikimedia.org/r/1151601 (owner: 10Ayounsi) [08:23:24] (03PS1) 10MVernon: apus: add docker-registry user with dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1151605 (https://phabricator.wikimedia.org/T394476) [08:23:43] (03CR) 10Brouberol: [C:03+2] deployment_server: create the airflow-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/1151602 (owner: 10Brouberol) [08:27:09] (03PS1) 10Majavah: O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) [08:28:50] (03PS2) 10Majavah: O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) [08:29:24] (03PS1) 10Majavah: Add fake metricsinfra Phabricator credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1151607 (https://phabricator.wikimedia.org/T394446) [08:30:09] PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:30:09] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - prometheus_80: Servers prometheus2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:30:52] (03PS1) 10Urbanecm: [beta] Do not use EventBus for weighted tag updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151608 (https://phabricator.wikimedia.org/T395414) [08:32:55] FIRING: [2x] SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:33:27] (03PS1) 10Marostegui: es1038: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1151609 (https://phabricator.wikimedia.org/T394469) [08:33:37] (03CR) 10Majavah: [V:03+2 C:03+2] Add fake metricsinfra Phabricator credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1151607 (https://phabricator.wikimedia.org/T394446) (owner: 10Majavah) [08:33:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1038 T394469', diff saved to https://phabricator.wikimedia.org/P76550 and previous config saved to /var/cache/conftool/dbconfig/20250528-083338-marostegui.json [08:33:49] T394469: Migrate es6 to MariaDB 10.11 - https://phabricator.wikimedia.org/T394469 [08:34:01] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1038.eqiad.wmnet with reason: Maintenance [08:34:12] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2035 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1151610 (https://phabricator.wikimedia.org/T395420) [08:34:26] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es2035 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1151611 (https://phabricator.wikimedia.org/T395421) [08:34:58] (03Abandoned) 10Marostegui: mariadb: Promote es2035 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1151611 (https://phabricator.wikimedia.org/T395421) (owner: 10Gerrit maintenance bot) [08:35:28] (03CR) 10Marostegui: [C:03+2] es1038: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1151609 (https://phabricator.wikimedia.org/T394469) (owner: 10Marostegui) [08:37:45] (03PS1) 10Tiziano Fogli: Revert "prometheus: Fix homepage redirect" [puppet] - 10https://gerrit.wikimedia.org/r/1151612 [08:41:00] (03CR) 10Elukey: [C:03+1] "<3" [labs/private] - 10https://gerrit.wikimedia.org/r/1151605 (https://phabricator.wikimedia.org/T394476) (owner: 10MVernon) [08:41:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10862565 (10Joe) [08:41:19] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10862567 (10Joe) p:05Triage→03High [08:42:31] (03CR) 10Majavah: [C:03+1] Revert "prometheus: Fix homepage redirect" [puppet] - 10https://gerrit.wikimedia.org/r/1151612 (owner: 10Tiziano Fogli) [08:42:42] (03CR) 10Tiziano Fogli: [C:03+2] Revert "prometheus: Fix homepage redirect" [puppet] - 10https://gerrit.wikimedia.org/r/1151612 (owner: 10Tiziano Fogli) [08:43:34] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1038.eqiad.wmnet with reason: Maintenance [08:46:21] (03PS1) 10Vgutierrez: hiera: Enable edge uniques in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1151613 (https://phabricator.wikimedia.org/T391411) [08:46:58] (03CR) 10MVernon: [V:03+2 C:03+2] apus: add docker-registry user with dummy credentials [labs/private] - 10https://gerrit.wikimedia.org/r/1151605 (https://phabricator.wikimedia.org/T394476) (owner: 10MVernon) [08:47:07] RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:47:09] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:48:21] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:48:59] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:51:09] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org [08:52:31] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1038.eqiad.wmnet with reason: Maintenance [08:53:16] (03CR) 10Michael Große: [C:03+1] "I concur with this approach. Testing Add a Link on beta is important for Growth for our current Q4 hypothesis." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151608 (https://phabricator.wikimedia.org/T395414) (owner: 10Urbanecm) [08:53:19] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151613 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [08:53:30] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org [08:53:36] jouncebot: nowandnext [08:53:36] For the next 1 hour(s) and 6 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T0800) [08:53:37] In 1 hour(s) and 6 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1000) [08:53:39] (03PS5) 10Hashar: gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [08:53:49] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [08:53:59] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp2004.wikimedia.org [08:54:03] (03CR) 10Urbanecm: [C:03+2] [beta] Do not use EventBus for weighted tag updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151608 (https://phabricator.wikimedia.org/T395414) (owner: 10Urbanecm) [08:54:09] 10ops-eqiad, 06DBA, 06DC-Ops: es1038 not booting up after reboot - https://phabricator.wikimedia.org/T395424 (10Marostegui) 03NEW [08:54:18] 10ops-eqiad, 06DBA, 06DC-Ops: es1038 not booting up after reboot - https://phabricator.wikimedia.org/T395424#10862637 (10Marostegui) p:05Triage→03High [08:54:28] 10ops-eqiad, 06DBA, 06DC-Ops: es1038 not booting up after reboot - https://phabricator.wikimedia.org/T395424#10862639 (10Marostegui) [08:54:45] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10862641 (10MoritzMuehlenhoff) [08:54:49] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10862643 (10MoritzMuehlenhoff) FYI, I've updated the task description, access to analytics-privatedata-access for WMF staff doesn't need service owner... [08:54:49] (03Merged) 10jenkins-bot: [beta] Do not use EventBus for weighted tag updates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151608 (https://phabricator.wikimedia.org/T395414) (owner: 10Urbanecm) [08:56:23] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti7002.magru.wmnet [08:56:27] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp2004.wikimedia.org [08:56:57] (03PS1) 10Marostegui: es1038: Host down [puppet] - 10https://gerrit.wikimedia.org/r/1151615 (https://phabricator.wikimedia.org/T395424) [08:57:35] 10ops-eqiad, 06DBA, 06DC-Ops, 13Patch-For-Review: es1038 not booting up after reboot - https://phabricator.wikimedia.org/T395424#10862664 (10Marostegui) I also tried a hardreset from the idrac, but it made no difference. [08:58:40] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge uniques in eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1151613 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:00:12] (03CR) 10Marostegui: [C:03+2] es1038: Host down [puppet] - 10https://gerrit.wikimedia.org/r/1151615 (https://phabricator.wikimedia.org/T395424) (owner: 10Marostegui) [09:00:26] Emperor: ok to merge? [09:00:54] (03PS1) 10Slyngshede: IDP: Failover [dns] - 10https://gerrit.wikimedia.org/r/1151616 [09:00:54] marostegui: you can skip those btw [09:01:08] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [09:01:10] (it's what I just did and pinged Emperor on -sre) [09:01:17] just saw it yes [09:01:30] I just skipped them vgutierrez thanks! [09:02:02] 10SRE-swift-storage, 10Ceph, 06Data-Persistence, 06serviceops, 13Patch-For-Review: Onboard the Docker Registry to apus - https://phabricator.wikimedia.org/T394476#10862692 (10MatthewVernon) @elukey OK, I've set that up for you. Quota is 3T, but can be adjusted as needed - it's really there so we can keep... [09:02:41] (03PS3) 10Majavah: O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) [09:02:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti7002.magru.wmnet [09:02:55] RESOLVED: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:03:54] (03PS4) 10Majavah: O:wmcs::metricsinfra: Add alertmanager Phab integration [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) [09:04:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for es2035.codfw.wmnet [09:04:28] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool es2035 - Upgrading es2035.codfw.wmnet [09:04:47] (03CR) 10Slyngshede: [C:03+2] IDP: Failover [dns] - 10https://gerrit.wikimedia.org/r/1151616 (owner: 10Slyngshede) [09:04:58] !log slyngshede@dns1004 START - running authdns-update [09:05:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2035', diff saved to https://phabricator.wikimedia.org/P76551 and previous config saved to /var/cache/conftool/dbconfig/20250528-090528-marostegui.json [09:05:39] !log slyngshede@dns1004 END - running authdns-update [09:05:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) es2035 - Upgrading es2035.codfw.wmnet [09:06:26] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for magru03 - jmm@cumin1003" [09:06:31] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new VIP for magru03 - jmm@cumin1003" [09:06:31] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:08:25] FIRING: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:08:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:11:35] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.upgrade (exit_code=99) for es2035.codfw.wmnet [09:11:39] (03PS1) 10Giuseppe Lavagetto: Modify makefile for new version of hiddenparma switching to pyproject.toml [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1151619 [09:11:47] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idp1004.wikimedia.org [09:12:47] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:14:21] !log instaling docker.io bookworm updates [09:14:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:44] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idp1004.wikimedia.org [09:17:37] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [09:18:22] ayounsi@cumin1002 netbox (PID 4000945) is awaiting input [09:21:33] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [09:22:12] (03PS1) 10Ayounsi: magru: add ganeti to the list of "customers" [homer/public] - 10https://gerrit.wikimedia.org/r/1151621 (https://phabricator.wikimedia.org/T394263) [09:22:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76553 and previous config saved to /var/cache/conftool/dbconfig/20250528-092212-root.json [09:23:24] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10862752 (10MoritzMuehlenhoff) [09:23:33] (03CR) 10Klausman: [C:03+1] "> Up to now we have been running with the custom seccomp settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:23:46] !log installing distro-info-data updates on Bullseye/Bookworm [09:23:48] (03CR) 10Klausman: [C:03+1] admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:23:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:49] (03PS1) 10Federico Ceratto: sre.mysql.upgrade Fix optional --task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/1151620 (https://phabricator.wikimedia.org/T395325) [09:24:49] (03CR) 10Federico Ceratto: "As discussed on IRC" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151620 (https://phabricator.wikimedia.org/T395325) (owner: 10Federico Ceratto) [09:24:53] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [homer/public] - 10https://gerrit.wikimedia.org/r/1151621 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:24:54] (03CR) 10Klausman: admin_ng: Update knative-serving image versions for ml-serve-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:25:06] (03CR) 10Elukey: "Yes definitely, I removed the support for overriding seccomp for the kserve-container only since we are going to do it at the pod level (s" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:26:21] (03CR) 10Elukey: admin_ng: Update knative-serving image versions for ml-serve-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [09:26:42] (03PS1) 10Slyngshede: IDM: Fail-back [dns] - 10https://gerrit.wikimedia.org/r/1151622 [09:27:12] (03CR) 10CI reject: [V:04-1] IDM: Fail-back [dns] - 10https://gerrit.wikimedia.org/r/1151622 (owner: 10Slyngshede) [09:27:37] ayounsi@cumin1002 netbox (PID 4000945) is awaiting input [09:29:19] (03PS1) 10Ayounsi: Add PTR include for public1-virtual-magru [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) [09:29:51] (03CR) 10CI reject: [V:04-1] Add PTR include for public1-virtual-magru [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:31:40] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [09:31:44] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:32:38] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10862773 (10MoritzMuehlenhoff) [09:33:13] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:34:14] (03CR) 10Cathal Mooney: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:34:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:35:25] !log instaling prometheus-postfix-exporter updates from Bookworm point release [09:35:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:44] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:37:10] (03PS1) 10Vgutierrez: hiera: Enable edge unique cookies in magru [puppet] - 10https://gerrit.wikimedia.org/r/1151625 (https://phabricator.wikimedia.org/T391411) [09:37:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76555 and previous config saved to /var/cache/conftool/dbconfig/20250528-093717-root.json [09:39:04] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:39:36] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151625 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:40:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428 (10GGoncalves-WMF) 03NEW [09:40:18] !log remove failing exim4 auto restart from crm2001 T383715 [09:40:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:41] (03PS1) 10Brouberol: ceph/server: allow the deployment servers to connect to the radowsgw server [puppet] - 10https://gerrit.wikimedia.org/r/1151628 (https://phabricator.wikimedia.org/T393998) [09:41:58] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru routed ganeti public gw IP - ayounsi@cumin1002" [09:42:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru routed ganeti public gw IP - ayounsi@cumin1002" [09:42:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:42:06] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:42:19] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10862809 (10MoritzMuehlenhoff) [09:42:34] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151628 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:43:07] (03CR) 10Ayounsi: [C:03+2] Add PTR include for public1-virtual-magru [dns] - 10https://gerrit.wikimedia.org/r/1151623 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:43:34] RESOLVED: SystemdUnitFailed: wmf_auto_restart_exim4.service on crm2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:43:44] !log ayounsi@dns1004 START - running authdns-update [09:43:44] (03CR) 10Btullis: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1151628 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:44:27] !log ayounsi@dns1004 END - running authdns-update [09:46:34] (03CR) 10Slyngshede: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1151622 (owner: 10Slyngshede) [09:47:25] (03PS2) 10Slyngshede: IDM: Fail-back [dns] - 10https://gerrit.wikimedia.org/r/1151622 [09:47:47] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1151621 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:48:06] (03CR) 10Ayounsi: [C:03+2] magru: add ganeti to the list of "customers" [homer/public] - 10https://gerrit.wikimedia.org/r/1151621 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:48:38] (03Merged) 10jenkins-bot: magru: add ganeti to the list of "customers" [homer/public] - 10https://gerrit.wikimedia.org/r/1151621 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [09:48:43] (03PS4) 10Brouberol: deployment_server: chown the airflow-dev private files to g:airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1151627 (https://phabricator.wikimedia.org/T395125) [09:49:38] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge unique cookies in magru [puppet] - 10https://gerrit.wikimedia.org/r/1151625 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [09:50:13] (03CR) 10Slyngshede: [C:03+2] IDM: Fail-back [dns] - 10https://gerrit.wikimedia.org/r/1151622 (owner: 10Slyngshede) [09:50:20] !log slyngshede@dns1004 START - running authdns-update [09:51:03] !log slyngshede@dns1004 END - running authdns-update [09:52:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76557 and previous config saved to /var/cache/conftool/dbconfig/20250528-095222-root.json [09:52:39] (03CR) 10Brouberol: [C:03+2] ceph/server: allow the deployment servers to connect to the radowsgw server [puppet] - 10https://gerrit.wikimedia.org/r/1151628 (https://phabricator.wikimedia.org/T393998) (owner: 10Brouberol) [09:53:59] (03PS1) 10Marostegui: mariadb: Move db2187 to x3 (s8) [puppet] - 10https://gerrit.wikimedia.org/r/1151629 (https://phabricator.wikimedia.org/T394884) [09:56:12] (03CR) 10Marostegui: [C:03+2] mariadb: Move db2187 to x3 (s8) [puppet] - 10https://gerrit.wikimedia.org/r/1151629 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [09:56:44] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [09:56:59] !log installing node-serialize-javascript security updates [09:57:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T395241)', diff saved to https://phabricator.wikimedia.org/P76558 and previous config saved to /var/cache/conftool/dbconfig/20250528-095707-fceratto.json [09:58:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [09:58:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T395241)', diff saved to https://phabricator.wikimedia.org/P76559 and previous config saved to /var/cache/conftool/dbconfig/20250528-095813-fceratto.json [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1000) [10:02:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2191.codfw.wmnet onto db2186.codfw.wmnet [10:03:10] !log marostegui@cumin1002 START - Cookbook sre.mysql.upgrade for db2187.codfw.wmnet [10:03:25] RESOLVED: SystemdUnitFailed: prometheus-ganeti-exporter.service on ganeti7002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T395241)', diff saved to https://phabricator.wikimedia.org/P76560 and previous config saved to /var/cache/conftool/dbconfig/20250528-100341-fceratto.json [10:03:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:38] (03PS1) 10Marostegui: redact_sanitarium.sh: Remove db2186 db2187 [puppet] - 10https://gerrit.wikimedia.org/r/1151631 (https://phabricator.wikimedia.org/T394884) [10:07:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76561 and previous config saved to /var/cache/conftool/dbconfig/20250528-100727-root.json [10:07:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T395241)', diff saved to https://phabricator.wikimedia.org/P76562 and previous config saved to /var/cache/conftool/dbconfig/20250528-100758-fceratto.json [10:08:04] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on backup[2002-2003].codfw.wmnet with reason: Downtime hosts for reboot [10:08:20] (03CR) 10Marostegui: [C:03+2] redact_sanitarium.sh: Remove db2186 db2187 [puppet] - 10https://gerrit.wikimedia.org/r/1151631 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [10:09:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2187.codfw.wmnet [10:09:22] (03CR) 10Btullis: [C:03+1] deployment_server: chown the airflow-dev private files to g:airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1151627 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [10:09:40] (03CR) 10Marostegui: [C:03+1] "I tested it and worked fine" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151620 (https://phabricator.wikimedia.org/T395325) (owner: 10Federico Ceratto) [10:10:05] (03CR) 10Federico Ceratto: [C:03+2] sre.mysql.upgrade Fix optional --task-id [cookbooks] - 10https://gerrit.wikimedia.org/r/1151620 (https://phabricator.wikimedia.org/T395325) (owner: 10Federico Ceratto) [10:11:51] (03CR) 10Brouberol: [C:03+2] deployment_server: chown the airflow-dev private files to g:airflow-deployers [puppet] - 10https://gerrit.wikimedia.org/r/1151627 (https://phabricator.wikimedia.org/T395125) (owner: 10Brouberol) [10:12:59] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Enabled ScopedTypeaheadSearch for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor) [10:13:08] !log initialise ganeti03 cluster in magru T394263 [10:13:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:14] T394263: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263 [10:14:33] (03PS3) 10Lucas Werkmeister (WMDE): Restore support for Dark Mode on Wikibase pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081067 (https://phabricator.wikimedia.org/T389330) (owner: 10Arthur taylor) [10:14:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10862843 (10MoritzMuehlenhoff) [10:14:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081067 (https://phabricator.wikimedia.org/T389330) (owner: 10Arthur taylor) [10:14:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor) [10:15:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:15:22] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2242.codfw.wmnet onto db2187.codfw.wmnet [10:15:26] !log marostegui@cumin1002 START - Cookbook sre.mysql.depool db2242 - Depool db2242.codfw.wmnet to then clone it to db2187.codfw.wmnet - marostegui@cumin1002 [10:15:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.depool (exit_code=0) db2242 - Depool db2242.codfw.wmnet to then clone it to db2187.codfw.wmnet - marostegui@cumin1002 [10:16:52] (03PS1) 10Marostegui: instances.yaml: Add db2187 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1151634 (https://phabricator.wikimedia.org/T394884) [10:18:02] (03CR) 10Marostegui: [C:03+2] instances.yaml: Add db2187 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1151634 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [10:18:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P76564 and previous config saved to /var/cache/conftool/dbconfig/20250528-101847-fceratto.json [10:18:57] PROBLEM - Host backup2013 is DOWN: PING CRITICAL - Packet loss = 100% [10:19:13] ^that's me, apparently downtime failed due to a non public ticket [10:20:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2187 to dbctl depooled T394884', diff saved to https://phabricator.wikimedia.org/P76565 and previous config saved to /var/cache/conftool/dbconfig/20250528-102015-marostegui.json [10:20:20] T394884: Remove sanitarium hosts from codfw - https://phabricator.wikimedia.org/T394884 [10:20:21] should be up now [10:20:25] RECOVERY - Host backup2013 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [10:22:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76566 and previous config saved to /var/cache/conftool/dbconfig/20250528-102233-root.json [10:23:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P76567 and previous config saved to /var/cache/conftool/dbconfig/20250528-102306-fceratto.json [10:28:02] (03PS1) 10Ayounsi: magru: add PTR include for routed ganeti v6 [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) [10:28:35] (03CR) 10CI reject: [V:04-1] magru: add PTR include for routed ganeti v6 [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:28:47] (03PS1) 10Muehlenhoff: Create site.pp entries for VM replacements from ganeti02/magru [puppet] - 10https://gerrit.wikimedia.org/r/1151642 (https://phabricator.wikimedia.org/T394263) [10:28:54] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [10:29:34] (03CR) 10Vgutierrez: [C:03+1] trafficserver: explicitly specify user/group for systemd unit [puppet] - 10https://gerrit.wikimedia.org/r/1091330 (owner: 10Ssingh) [10:29:49] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:29:55] (03CR) 10CI reject: [V:04-1] Create site.pp entries for VM replacements from ganeti02/magru [puppet] - 10https://gerrit.wikimedia.org/r/1151642 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:32:29] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on dbprov[1003-1006].eqiad.wmnet with reason: Downtime hosts for reboot [10:32:49] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [10:33:00] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:33:27] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru routed ganeti v6 gw IP - ayounsi@cumin1002" [10:33:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: magru routed ganeti v6 gw IP - ayounsi@cumin1002" [10:33:33] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:33:37] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:33:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P76568 and previous config saved to /var/cache/conftool/dbconfig/20250528-103354-fceratto.json [10:35:17] (03PS2) 10Muehlenhoff: Create site.pp entries for VM replacements from ganeti02/magru [puppet] - 10https://gerrit.wikimedia.org/r/1151642 (https://phabricator.wikimedia.org/T394263) [10:35:41] (03CR) 10Klausman: [C:03+1] admin_ng: Update knative-serving image versions for ml-serve-eqiad (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:36:18] (03CR) 10Klausman: [C:03+1] "That sounds like a good approach. Thank you!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151600 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [10:36:20] (03CR) 10Ayounsi: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:36:23] (03CR) 10CI reject: [V:04-1] Create site.pp entries for VM replacements from ganeti02/magru [puppet] - 10https://gerrit.wikimedia.org/r/1151642 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:37:23] (03PS2) 10Vgutierrez: hiera: Use katran in lvs1013 [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) [10:37:23] (03PS1) 10Vgutierrez: liberica: Don't set forwarding_cores/numa_node for katran [puppet] - 10https://gerrit.wikimedia.org/r/1151644 (https://phabricator.wikimedia.org/T395228) [10:37:30] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on dbprov[2003-2006].codfw.wmnet with reason: Downtime hosts for reboot [10:37:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76569 and previous config saved to /var/cache/conftool/dbconfig/20250528-103738-root.json [10:38:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P76570 and previous config saved to /var/cache/conftool/dbconfig/20250528-103813-fceratto.json [10:39:05] (03PS2) 10Ayounsi: magru: add PTR include for routed ganeti v6 [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) [10:39:42] (03CR) 10Ayounsi: [C:03+2] magru: add PTR include for routed ganeti v6 [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:39:59] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1150626 (https://phabricator.wikimedia.org/T395228) (owner: 10Vgutierrez) [10:40:04] !log ayounsi@dns1004 START - running authdns-update [10:40:48] !log ayounsi@dns1004 END - running authdns-update [10:42:30] (03PS1) 10Stevemunene: hdfs: Exclude group 6 rack E7 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151646 (https://phabricator.wikimedia.org/T390173) [10:43:37] (03PS3) 10Muehlenhoff: Create site.pp entries for VM replacements from ganeti02/magru [puppet] - 10https://gerrit.wikimedia.org/r/1151642 (https://phabricator.wikimedia.org/T394263) [10:45:41] (03PS7) 10FNegri: wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) [10:46:52] (03CR) 10CI reject: [V:04-1] wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [10:47:37] (03CR) 10Cathal Mooney: [C:03+1] magru: add PTR include for routed ganeti v6 [dns] - 10https://gerrit.wikimedia.org/r/1151641 (https://phabricator.wikimedia.org/T394263) (owner: 10Ayounsi) [10:49:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T395241)', diff saved to https://phabricator.wikimedia.org/P76571 and previous config saved to /var/cache/conftool/dbconfig/20250528-104901-fceratto.json [10:49:21] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [10:49:30] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T395241)', diff saved to https://phabricator.wikimedia.org/P76572 and previous config saved to /var/cache/conftool/dbconfig/20250528-104928-fceratto.json [10:51:15] (03PS2) 10Btullis: Airflow: Enable the LocalExecutor for the analytics_test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149340 (https://phabricator.wikimedia.org/T394398) [10:52:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76573 and previous config saved to /var/cache/conftool/dbconfig/20250528-105245-root.json [10:53:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T395241)', diff saved to https://phabricator.wikimedia.org/P76574 and previous config saved to /var/cache/conftool/dbconfig/20250528-105320-fceratto.json [10:53:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [10:53:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76575 and previous config saved to /var/cache/conftool/dbconfig/20250528-105346-fceratto.json [10:54:23] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5702/co" [puppet] - 10https://gerrit.wikimedia.org/r/1151606 (https://phabricator.wikimedia.org/T394446) (owner: 10Majavah) [10:54:59] (03CR) 10Jgiannelos: [C:03+2] pcs/RB sunset: Remove unnecessary definition rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151177 (owner: 10Jgiannelos) [10:55:14] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1142607 (owner: 10PipelineBot) [10:55:46] (03CR) 10Muehlenhoff: [C:03+2] Create site.pp entries for VM replacements from ganeti02/magru [puppet] - 10https://gerrit.wikimedia.org/r/1151642 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [10:56:03] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T395241)', diff saved to https://phabricator.wikimedia.org/P76576 and previous config saved to /var/cache/conftool/dbconfig/20250528-105602-fceratto.json [10:56:38] (03Merged) 10jenkins-bot: pcs/RB sunset: Remove unnecessary definition rule [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151177 (owner: 10Jgiannelos) [10:57:01] (03PS1) 10Vgutierrez: hiera: Enable edge unique cookies in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1151647 (https://phabricator.wikimedia.org/T391411) [10:58:41] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151647 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [11:00:14] (03CR) 10Brouberol: [C:03+1] hdfs: Exclude group 6 rack E7 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151646 (https://phabricator.wikimedia.org/T390173) (owner: 10Stevemunene) [11:04:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76577 and previous config saved to /var/cache/conftool/dbconfig/20250528-110423-fceratto.json [11:07:42] (03CR) 10FNegri: [C:03+2] wikireplicas: remove dashes from script names [puppet] - 10https://gerrit.wikimedia.org/r/1148358 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [11:07:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2035 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76578 and previous config saved to /var/cache/conftool/dbconfig/20250528-110750-root.json [11:11:10] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P76579 and previous config saved to /var/cache/conftool/dbconfig/20250528-111109-fceratto.json [11:18:10] (03PS2) 10Ayounsi: Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) [11:19:22] (03CR) 10CI reject: [V:04-1] Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [11:19:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P76580 and previous config saved to /var/cache/conftool/dbconfig/20250528-111931-fceratto.json [11:20:28] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudlb1001.eqiad.wmnet with OS bookworm [11:20:58] (03PS3) 10Ayounsi: Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) [11:23:37] (03PS1) 10Muehlenhoff: netbox: Add magru03 [puppet] - 10https://gerrit.wikimedia.org/r/1151653 (https://phabricator.wikimedia.org/T394263) [11:24:02] PROBLEM - BGP status on cloudsw1-c8-eqiad.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:24:18] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [11:25:04] (03PS1) 10FNegri: wikireplicas: delete obsolete maintain* users [puppet] - 10https://gerrit.wikimedia.org/r/1151655 (https://phabricator.wikimedia.org/T395432) [11:25:06] (03PS1) 10FNegri: wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1151656 (https://phabricator.wikimedia.org/T395266) [11:25:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 137409 [11:26:10] (03PS2) 10FNegri: wikireplicas: delete obsolete maintain* users [puppet] - 10https://gerrit.wikimedia.org/r/1151655 (https://phabricator.wikimedia.org/T395432) [11:26:11] (03PS2) 10FNegri: wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1151656 (https://phabricator.wikimedia.org/T395266) [11:26:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P76581 and previous config saved to /var/cache/conftool/dbconfig/20250528-112617-fceratto.json [11:28:06] (03Abandoned) 10FNegri: wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1150715 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [11:29:40] ayounsi@cumin1002 peering (PID 4140049) is awaiting input [11:29:43] (03CR) 10CI reject: [V:04-1] wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1151656 (https://phabricator.wikimedia.org/T395266) (owner: 10FNegri) [11:32:24] (03CR) 10Ayounsi: [C:03+1] netbox: Add magru03 [puppet] - 10https://gerrit.wikimedia.org/r/1151653 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:34:38] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P76582 and previous config saved to /var/cache/conftool/dbconfig/20250528-113437-fceratto.json [11:34:46] (03PS1) 10Btullis: All all users to traverse /etc/helmfile-defaults/private and subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) [11:34:48] (03CR) 10Muehlenhoff: [C:03+2] netbox: Add magru03 [puppet] - 10https://gerrit.wikimedia.org/r/1151653 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [11:35:14] (03PS3) 10FNegri: wikireplicas: split db config into separate file [puppet] - 10https://gerrit.wikimedia.org/r/1151656 (https://phabricator.wikimedia.org/T395266) [11:35:59] (03PS1) 10Majavah: team-wmcs: Add host-bound alerts for BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151658 [11:37:27] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5703/co" [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) (owner: 10Btullis) [11:37:47] (03PS2) 10Btullis: All all users to traverse /etc/helmfile-defaults/private and subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) [11:38:27] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 137409 [11:40:52] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5704/co" [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) (owner: 10Btullis) [11:41:13] (03PS3) 10Btullis: All all users to traverse /etc/helmfile-defaults/private and subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) [11:41:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T395241)', diff saved to https://phabricator.wikimedia.org/P76583 and previous config saved to /var/cache/conftool/dbconfig/20250528-114124-fceratto.json [11:41:42] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [11:41:49] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T395241)', diff saved to https://phabricator.wikimedia.org/P76584 and previous config saved to /var/cache/conftool/dbconfig/20250528-114149-fceratto.json [11:43:27] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'email' for AS: 141626 [11:43:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 141626 [11:46:32] (03CR) 10Stevemunene: [C:03+2] hdfs: Exclude group 6 rack E7 hosts from analytics cluster [puppet] - 10https://gerrit.wikimedia.org/r/1151646 (https://phabricator.wikimedia.org/T390173) (owner: 10Stevemunene) [11:47:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T395241)', diff saved to https://phabricator.wikimedia.org/P76585 and previous config saved to /var/cache/conftool/dbconfig/20250528-114708-fceratto.json [11:49:45] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T395241)', diff saved to https://phabricator.wikimedia.org/P76586 and previous config saved to /var/cache/conftool/dbconfig/20250528-114944-fceratto.json [11:50:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance [11:50:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T395241)', diff saved to https://phabricator.wikimedia.org/P76587 and previous config saved to /var/cache/conftool/dbconfig/20250528-115012-fceratto.json [11:52:25] FIRING: SystemdUnitFailed: netbox_ganeti_magru03_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:52:38] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker[1163-1165].eqiad.wmnet with reason: Upgrade an-worker hard drives from 4TB to 8TB [11:52:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10863182 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=48236433-7e8e-4078-bb... [11:54:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10863184 (10Stevemunene) [11:54:59] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 6 - rack E7) - https://phabricator.wikimedia.org/T390173#10863185 (10Stevemunene) a:03Stevemunene [11:57:39] (03PS4) 10Btullis: Allow all users to traverse /etc/helmfile-defaults/private and subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) [11:57:52] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on backup[1010-1011].eqiad.wmnet with reason: Downtime hosts for reboot [11:59:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T395241)', diff saved to https://phabricator.wikimedia.org/P76588 and previous config saved to /var/cache/conftool/dbconfig/20250528-115958-fceratto.json [12:01:21] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet [12:01:23] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:02:16] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P76589 and previous config saved to /var/cache/conftool/dbconfig/20250528-120215-fceratto.json [12:02:25] RESOLVED: SystemdUnitFailed: netbox_ganeti_magru03_sync.service on netbox1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:59] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin2002" [12:05:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin2002" [12:05:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:05:07] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors [12:05:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors [12:05:26] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:07:39] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge unique cookies in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1151647 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [12:10:27] taavi@cumin1002 reimage (PID 4134905) is awaiting input [12:11:01] jmm@cumin2002 makevm (PID 3727761) is awaiting input [12:11:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [12:11:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=93) for new host ncredir7003.magru.wmnet [12:12:51] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage [12:15:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P76590 and previous config saved to /var/cache/conftool/dbconfig/20250528-121505-fceratto.json [12:15:44] (03CR) 10Marostegui: [C:03+1] "This is fine, but I'd suggest until the tests are ok and this can proceed." [puppet] - 10https://gerrit.wikimedia.org/r/1151655 (https://phabricator.wikimedia.org/T395432) (owner: 10FNegri) [12:15:47] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1001.eqiad.wmnet with reason: host reimage [12:16:14] (03CR) 10Brouberol: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) (owner: 10Btullis) [12:17:23] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P76591 and previous config saved to /var/cache/conftool/dbconfig/20250528-121723-fceratto.json [12:17:29] (03PS1) 10Marostegui: check_private_data_report: Remove db2186, db2187 [puppet] - 10https://gerrit.wikimedia.org/r/1151664 (https://phabricator.wikimedia.org/T394884) [12:17:36] (03PS1) 10Muehlenhoff: ganeti: Make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) [12:19:50] !log jynus@cumin1002 DONE (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 4:00:00 on backup[2010-2011].codfw.wmnet with reason: Downtime hosts for reboot [12:20:41] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [12:20:48] (03CR) 10Marostegui: [C:03+2] check_private_data_report: Remove db2186, db2187 [puppet] - 10https://gerrit.wikimedia.org/r/1151664 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [12:24:25] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5705/co" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [12:24:39] (03CR) 10Ayounsi: "lgtm, dunno if tests are needed." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:25:37] (03CR) 10Volans: [C:04-1] ganeti: Make the storage_type configurable (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:25:52] (03PS2) 10Muehlenhoff: ganeti: Make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) [12:26:34] (03CR) 10Tiziano Fogli: [C:03+1] Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:28:16] (03CR) 10Elukey: [C:03+2] admin_ng: Update knative-serving image versions for ml-serve-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151596 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:28:27] (03PS2) 10Majavah: team-wmcs: Add host-bound alerts for BGP sessions [alerts] - 10https://gerrit.wikimedia.org/r/1151658 (https://phabricator.wikimedia.org/T388641) [12:30:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P76592 and previous config saved to /var/cache/conftool/dbconfig/20250528-123012-fceratto.json [12:31:07] (03PS1) 10Marostegui: x3: Make db1255 and db2241 masters [puppet] - 10https://gerrit.wikimedia.org/r/1151668 (https://phabricator.wikimedia.org/T390530) [12:31:08] !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:31:32] !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-druid1003.eqiad.wmnet with reason: Cold booting to address disk failure [12:32:13] (03CR) 10Ayounsi: [C:03+2] Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:32:31] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T395241)', diff saved to https://phabricator.wikimedia.org/P76593 and previous config saved to /var/cache/conftool/dbconfig/20250528-123230-fceratto.json [12:32:44] !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:32:49] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [12:32:56] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T395241)', diff saved to https://phabricator.wikimedia.org/P76594 and previous config saved to /var/cache/conftool/dbconfig/20250528-123255-fceratto.json [12:33:21] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:33:25] (03Merged) 10jenkins-bot: Add alerting for long peering BGP down [alerts] - 10https://gerrit.wikimedia.org/r/1151551 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:33:25] (03PS2) 10Marostegui: x3: Make db1255 and db2241 masters [puppet] - 10https://gerrit.wikimedia.org/r/1151668 (https://phabricator.wikimedia.org/T390530) [12:33:41] (03CR) 10Elukey: [C:03+2] admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:33:46] (03PS4) 10Elukey: admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) [12:33:48] (03CR) 10CI reject: [V:04-1] ganeti: Make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:33:49] (03CR) 10Elukey: [V:03+2 C:03+2] admin_ng: enable podspec-securitycontext for all knative clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151597 (https://phabricator.wikimedia.org/T369493) (owner: 10Elukey) [12:34:22] !log elukey@deploy1003 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [12:34:31] !log elukey@deploy1003 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [12:35:33] (03PS3) 10Muehlenhoff: ganeti: Make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) [12:36:03] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1001.eqiad.wmnet with OS bookworm [12:36:05] RECOVERY - BGP status on cloudsw1-c8-eqiad.mgmt is OK: BGP OK - up: 18, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [12:36:18] (03CR) 10Muehlenhoff: ganeti: Make the storage_type configurable (033 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:37:28] (03PS1) 10Marostegui: x3 codfw replicas: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1151670 (https://phabricator.wikimedia.org/T390530) [12:38:46] !log dbmaint x3 codfw make it SBR T390530 [12:38:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:50] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [12:38:55] !log dbmaint x3 codfw make it SBR T383795 [12:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:39:00] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [12:39:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T395241)', diff saved to https://phabricator.wikimedia.org/P76595 and previous config saved to /var/cache/conftool/dbconfig/20250528-123921-fceratto.json [12:39:37] (03CR) 10Marostegui: [C:03+2] x3 codfw replicas: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1151670 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:42:07] (03PS2) 10Btullis: Airflow: Allow the scheduler to reach out to Hadoop on analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149341 (https://phabricator.wikimedia.org/T394398) [12:42:12] (03PS2) 10Btullis: Airflow: increase resources to the analytics_test scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149342 (https://phabricator.wikimedia.org/T394398) [12:43:34] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:44:10] (03CR) 10CI reject: [V:04-1] ganeti: Make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [12:44:35] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db2242 gradually with 4 steps - Pool db2242.codfw.wmnet in after cloning [12:45:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T395241)', diff saved to https://phabricator.wikimedia.org/P76597 and previous config saved to /var/cache/conftool/dbconfig/20250528-124519-fceratto.json [12:45:40] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [12:45:47] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T395241)', diff saved to https://phabricator.wikimedia.org/P76598 and previous config saved to /var/cache/conftool/dbconfig/20250528-124547-fceratto.json [12:46:55] (03PS22) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) [12:47:12] (03CR) 10Arnaudb: gerrit: lock, preflight checks, hieradata lookups, verbosity (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1145208 (https://phabricator.wikimedia.org/T393034) (owner: 10Arnaudb) [12:48:45] (03PS1) 10Majavah: team-wmcs: haproxy: Adapt for metric changes in bookworm [alerts] - 10https://gerrit.wikimedia.org/r/1151673 (https://phabricator.wikimedia.org/T375082) [12:50:08] (03PS2) 10Anzx: huwikibooks: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151672 (https://phabricator.wikimedia.org/T395397) [12:50:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151672 (https://phabricator.wikimedia.org/T395397) (owner: 10Anzx) [12:51:20] (03PS1) 10Marostegui: x3 eqiad replicas: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1151674 (https://phabricator.wikimedia.org/T390530) [12:51:58] (03PS1) 10Vgutierrez: hiera: Enable edge unique cookies in esams [puppet] - 10https://gerrit.wikimedia.org/r/1151675 (https://phabricator.wikimedia.org/T391411) [12:52:57] (03CR) 10Marostegui: [C:03+2] x3 eqiad replicas: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1151674 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:53:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T395241)', diff saved to https://phabricator.wikimedia.org/P76599 and previous config saved to /var/cache/conftool/dbconfig/20250528-125305-fceratto.json [12:53:39] (03PS2) 10Vgutierrez: hiera: Enable edge unique cookies in esams [puppet] - 10https://gerrit.wikimedia.org/r/1151675 (https://phabricator.wikimedia.org/T391411) [12:53:44] !log dbmaint x3 eqiad make it SBR T383795 T390530 [12:53:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:53:49] T383795: Move sX to STATEMENT based replication - https://phabricator.wikimedia.org/T383795 [12:53:49] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [12:54:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P76600 and previous config saved to /var/cache/conftool/dbconfig/20250528-125427-fceratto.json [12:54:54] (03PS3) 10Vgutierrez: hiera: Enable edge unique cookies in esams [puppet] - 10https://gerrit.wikimedia.org/r/1151675 (https://phabricator.wikimedia.org/T391411) [12:55:13] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151675 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [12:55:17] (03CR) 10Marostegui: "This needs to be merged when we are ready to go RW on these hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1151668 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [12:55:26] (03PS3) 10Marostegui: x3: Make db1255 and db2241 masters [puppet] - 10https://gerrit.wikimedia.org/r/1151668 (https://phabricator.wikimedia.org/T390530) [12:56:13] (03CR) 10Jforrester: [C:03+1] "I'll leave it for you, rather than steal all the fun. :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [12:56:50] (03PS1) 10Marostegui: events_sanitarium.sql: Remove db2186, db2187 [software] - 10https://gerrit.wikimedia.org/r/1151676 (https://phabricator.wikimedia.org/T394884) [12:59:58] (03CR) 10Marostegui: "This is a NOOP" [software] - 10https://gerrit.wikimedia.org/r/1151676 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [12:59:59] (03CR) 10Marostegui: [C:03+2] events_sanitarium.sql: Remove db2186, db2187 [software] - 10https://gerrit.wikimedia.org/r/1151676 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1300). [13:00:05] James_F, Lucas_WMDE, and anzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] * James_F waves. [13:00:10] o/ [13:00:17] o/ [13:00:17] Lucas_WMDE: Are you deploying? [13:00:27] (03Merged) 10jenkins-bot: events_sanitarium.sql: Remove db2186, db2187 [software] - 10https://gerrit.wikimedia.org/r/1151676 (https://phabricator.wikimedia.org/T394884) (owner: 10Marostegui) [13:00:51] can do [13:01:01] though idk if I feel confident enough to deploy the beta change [13:01:16] Oh, it's trivial. But I can do it if you want. [13:01:19] sure [13:01:54] I’m also still debating what to do about anzx’ change… I’ve been looking at T395397 and I’m inclined to say that there’s still no huwikibooks consensus to be seen [13:01:55] T395397: Transwiki import missing setting in Hungarian Wikibooks - https://phabricator.wikimedia.org/T395397 [13:02:09] just huwiki discussions and a metawiki steward request [13:02:25] (James_F you can go ahead in the meantime ^^) [13:02:30] Ack. [13:02:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [13:03:22] (03Merged) 10jenkins-bot: [BETA CLUSTER] Close en_rtlwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1140976 (owner: 10Jforrester) [13:03:45] Lucas_WMDE: i didn't realise discussion happened in Wikipedia, i thought it was wikibooks [13:03:49] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1140976|[BETA CLUSTER] Close en_rtlwiki]] [13:04:02] (03PS1) 10Muehlenhoff: sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) [13:04:12] anzx: only the admin access :/ [13:04:21] * Lucas_WMDE takes another look at that [13:04:34] (note, the link in phab is broken, the `.` at the end needs to be part of the URL) [13:05:57] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1140976|[BETA CLUSTER] Close en_rtlwiki]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:06:15] eh, the importing was at least discussed during their adminship request, with generally supportive responses AFAICT [13:06:39] !log jforrester@deploy1003 jforrester: Continuing with sync [13:06:52] (03PS1) 10Marostegui: wmnet: Update m1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1151679 (https://phabricator.wikimedia.org/T395241) [13:07:11] !log Failover m1 master eqiad dbmaint T395241 [13:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:47] (03CR) 10Marostegui: [C:03+2] wmnet: Update m1-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/1151679 (https://phabricator.wikimedia.org/T395241) (owner: 10Marostegui) [13:07:49] !log marostegui@dns1006 START - running authdns-update [13:07:55] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:08:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P76602 and previous config saved to /var/cache/conftool/dbconfig/20250528-130812-fceratto.json [13:08:29] !log marostegui@dns1006 END - running authdns-update [13:08:33] !log marostegui@dns1006 START - running authdns-update [13:09:17] !log marostegui@dns1006 END - running authdns-update [13:09:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P76603 and previous config saved to /var/cache/conftool/dbconfig/20250528-130935-fceratto.json [13:10:32] (03CR) 10CI reject: [V:04-1] sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:10:38] (03CR) 10Bking: [C:03+2] relforge: disable monitoring notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151381 (https://phabricator.wikimedia.org/T395309) (owner: 10Bking) [13:12:49] anzx: fyi I’m planning to do your deployment as soon as the current one is done (and then do my changes afterwards) [13:12:54] (03PS2) 10Muehlenhoff: sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) [13:13:05] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.5% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [13:13:33] Lucas_WMDE: Over to you. [13:13:33] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1140976|[BETA CLUSTER] Close en_rtlwiki]] (duration: 09m 44s) [13:13:42] ok! [13:14:03] (03CR) 10Ayounsi: [C:03+2] Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:14:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151672 (https://phabricator.wikimedia.org/T395397) (owner: 10Anzx) [13:14:46] (left a comment on the phab task with my opinion on the community consensus in this case ^^) [13:15:02] (03Merged) 10jenkins-bot: huwikibooks: add importsources [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151672 (https://phabricator.wikimedia.org/T395397) (owner: 10Anzx) [13:15:25] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1151672|huwikibooks: add importsources (T395397)]] [13:15:29] T395397: Transwiki import missing setting in Hungarian Wikibooks - https://phabricator.wikimedia.org/T395397 [13:15:44] (03PS12) 10Tiziano Fogli: pdb_resource_exporter: add puppetdb resource exporter to puppedb [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) [13:15:45] (03PS4) 10Tiziano Fogli: monitoring services: add migration task as parameter [puppet] - 10https://gerrit.wikimedia.org/r/1150709 (https://phabricator.wikimedia.org/T395443) [13:17:33] (03PS4) 10Volans: ganeti: make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:17:36] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Backport for [[gerrit:1151672|huwikibooks: add importsources (T395397)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:17:38] Lucas_WMDE: looking [13:17:50] thanks [13:18:14] (03CR) 10Volans: "As agreed on IRC I've fixed the tests and done some minor changes. LMK what do you think." [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:18:28] Lucas_WMDE: looks good [13:18:31] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, anzx: Continuing with sync [13:18:33] \o/ [13:18:38] thanks for testing :) [13:19:08] (03PS5) 10Volans: ganeti: make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:19:30] (03CR) 10CI reject: [V:04-1] sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:19:38] !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002 [13:19:40] (03CR) 10Effie Mouzeli: "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [13:19:42] T383811: Ensure Search Platform-owned Elasticsearch cookbooks can handle Opensearch - https://phabricator.wikimedia.org/T383811 [13:20:17] !log bking@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.RESTART (3 nodes at a time) for ElasticSearch cluster search_eqiad: T383811 - bking@cumin2002 [13:20:20] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151675 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:23:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P76605 and previous config saved to /var/cache/conftool/dbconfig/20250528-132320-fceratto.json [13:24:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T395241)', diff saved to https://phabricator.wikimedia.org/P76606 and previous config saved to /var/cache/conftool/dbconfig/20250528-132443-fceratto.json [13:25:01] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [13:25:08] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T395241)', diff saved to https://phabricator.wikimedia.org/P76607 and previous config saved to /var/cache/conftool/dbconfig/20250528-132508-fceratto.json [13:25:32] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151672|huwikibooks: add importsources (T395397)]] (duration: 10m 06s) [13:25:36] T395397: Transwiki import missing setting in Hungarian Wikibooks - https://phabricator.wikimedia.org/T395397 [13:25:39] Lucas_WMDE: Thanks for deploying [13:26:53] np :) [13:27:23] (03CR) 10Effie Mouzeli: "chainsaw tests look ok" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) (owner: 10Effie Mouzeli) [13:27:47] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge unique cookies in esams [puppet] - 10https://gerrit.wikimedia.org/r/1151675 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [13:27:51] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [13:28:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081067 (https://phabricator.wikimedia.org/T389330) (owner: 10Arthur taylor) [13:28:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor) [13:29:04] (03PS12) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [13:29:32] (03Merged) 10jenkins-bot: Restore support for Dark Mode on Wikibase pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1081067 (https://phabricator.wikimedia.org/T389330) (owner: 10Arthur taylor) [13:29:36] (03Merged) 10jenkins-bot: Enabled ScopedTypeaheadSearch for test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148834 (https://phabricator.wikimedia.org/T394669) (owner: 10Arthur taylor) [13:29:57] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1081067|Restore support for Dark Mode on Wikibase pages (T389330)]], [[gerrit:1148834|Enabled ScopedTypeaheadSearch for test.wikidata.org (T394669)]] [13:30:08] !log marostegui@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) db2242 gradually with 4 steps - Pool db2242.codfw.wmnet in after cloning [13:30:14] T389330: Restore support for Dark Mode on Wikibase Pages - https://phabricator.wikimedia.org/T389330 [13:30:14] T394669: Enable Scoped Type-Ahead Search on Test and Beta Wikidata - https://phabricator.wikimedia.org/T394669 [13:31:11] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new ncredir node for magru03 - jmm@cumin2002" [13:31:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.526s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:31:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: new ncredir node for magru03 - jmm@cumin2002" [13:31:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:39] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T395241)', diff saved to https://phabricator.wikimedia.org/P76609 and previous config saved to /var/cache/conftool/dbconfig/20250528-133139-fceratto.json [13:32:07] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, arthurtaylor: Backport for [[gerrit:1081067|Restore support for Dark Mode on Wikibase pages (T389330)]], [[gerrit:1148834|Enabled ScopedTypeaheadSearch for test.wikidata.org (T394669)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:32:59] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:33:11] testing… [13:33:19] (03PS1) 10Bking: apifeatureusage: switch to Observability-maintained curator [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) [13:33:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [13:34:32] dark mode works and looks fine [13:35:04] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10863623 (10MatthewVernon) Just apropos the sizes, I took one at random (wikipedia-commons-local-thumb.f9), and whilst eqiad is bigger, it's not a lot bigger: eqiad: 9,285,690 objects... [13:35:08] (03PS1) 10Clément Goubert: mw-cron, mw-script: Limit resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151689 (https://phabricator.wikimedia.org/T395436) [13:35:26] Scoped Typeahead Search also works and looks fine [13:35:29] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, arthurtaylor: Continuing with sync [13:35:31] let’s roll [13:36:09] \o/ party :) [13:36:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.526s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:36:38] (03CR) 10Ayounsi: [C:03+1] ganeti: make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:37:45] (03CR) 10Volans: [C:03+2] ganeti: make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:38:19] (03CR) 10Alexandros Kosiaris: [C:03+1] Allow all users to traverse /etc/helmfile-defaults/private and subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) (owner: 10Btullis) [13:38:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T395241)', diff saved to https://phabricator.wikimedia.org/P76610 and previous config saved to /var/cache/conftool/dbconfig/20250528-133827-fceratto.json [13:38:47] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2194.codfw.wmnet with reason: Maintenance [13:38:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T395241)', diff saved to https://phabricator.wikimedia.org/P76611 and previous config saved to /var/cache/conftool/dbconfig/20250528-133854-fceratto.json [13:39:05] (03CR) 10Btullis: [C:03+2] Allow all users to traverse /etc/helmfile-defaults/private and subdirs [puppet] - 10https://gerrit.wikimedia.org/r/1151657 (https://phabricator.wikimedia.org/T393998) (owner: 10Btullis) [13:39:23] (03CR) 10Cathal Mooney: [C:03+1] Icinga: remove some network devices checks [puppet] - 10https://gerrit.wikimedia.org/r/1144650 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [13:39:41] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151299 (owner: 10GOlson) [13:39:55] (03CR) 10Tiziano Fogli: "Thank you @adenisse@wikimedia.org. I've filed all the tasks related to the Icinga sunsetting preparation and linked the relevant ones to t" [puppet] - 10https://gerrit.wikimedia.org/r/1143600 (https://phabricator.wikimedia.org/T395442) (owner: 10Tiziano Fogli) [13:39:56] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1255.eqiad.wmnet with reason: Maintenance [13:40:17] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2241.codfw.wmnet with reason: Maintenance [13:40:41] (03CR) 10Marostegui: [C:03+2] x3: Make db1255 and db2241 masters [puppet] - 10https://gerrit.wikimedia.org/r/1151668 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [13:41:33] (03CR) 10Kamila Součková: [C:03+1] mw-cron, mw-script: Limit resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151689 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [13:42:10] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2187,2200,2242-2243].codfw.wmnet with reason: Maintenance [13:42:29] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1081067|Restore support for Dark Mode on Wikibase pages (T389330)]], [[gerrit:1148834|Enabled ScopedTypeaheadSearch for test.wikidata.org (T394669)]] (duration: 12m 31s) [13:42:35] T389330: Restore support for Dark Mode on Wikibase Pages - https://phabricator.wikimedia.org/T389330 [13:42:35] T394669: Enable Scoped Type-Ahead Search on Test and Beta Wikidata - https://phabricator.wikimedia.org/T394669 [13:42:53] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye [13:43:01] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10863646 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host... [13:43:02] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 7 hosts with reason: Maintenance [13:43:03] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Replace failed disk in an-druid1003 - https://phabricator.wikimedia.org/T395450 (10BTullis) 03NEW [13:43:06] (03CR) 10Andrew Bogott: [C:03+1] team-wmcs: haproxy: Adapt for metric changes in bookworm [alerts] - 10https://gerrit.wikimedia.org/r/1151673 (https://phabricator.wikimedia.org/T375082) (owner: 10Majavah) [13:43:16] (03CR) 10Majavah: [C:03+2] team-wmcs: haproxy: Adapt for metric changes in bookworm [alerts] - 10https://gerrit.wikimedia.org/r/1151673 (https://phabricator.wikimedia.org/T375082) (owner: 10Majavah) [13:43:33] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Replace failed disk in an-druid1003 - https://phabricator.wikimedia.org/T395450#10863663 (10BTullis) [13:43:35] !log UTC afternoon backport+config window done [13:43:36] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10863665 (10BTullis) [13:43:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10863666 (10Jclark-ctr) [13:44:14] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 32934 [13:44:26] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Modify makefile for new version of hiddenparma switching to pyproject.toml [software/hiddenparma/deploy] - 10https://gerrit.wikimedia.org/r/1151619 (owner: 10Giuseppe Lavagetto) [13:44:29] (03Merged) 10jenkins-bot: team-wmcs: haproxy: Adapt for metric changes in bookworm [alerts] - 10https://gerrit.wikimedia.org/r/1151673 (https://phabricator.wikimedia.org/T375082) (owner: 10Majavah) [13:44:31] !log Move db1211 and db2162 under x3 masters T390530 T351820 [13:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] T390530: Create topology for x3 hosts - https://phabricator.wikimedia.org/T390530 [13:44:36] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [13:44:52] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10863674 (10BTullis) [13:44:54] (03PS3) 10Muehlenhoff: sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) [13:45:04] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10863678 (10Jclark-ctr) @Stevemunene replaced the drives all yours now [13:45:35] (03PS1) 10Marostegui: db1211,db2162: Move them under x3 [puppet] - 10https://gerrit.wikimedia.org/r/1151692 (https://phabricator.wikimedia.org/T390530) [13:46:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T395241)', diff saved to https://phabricator.wikimedia.org/P76612 and previous config saved to /var/cache/conftool/dbconfig/20250528-134604-fceratto.json [13:46:48] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P76613 and previous config saved to /var/cache/conftool/dbconfig/20250528-134647-fceratto.json [13:46:55] (03CR) 10Ladsgroup: [C:03+1] db1211,db2162: Move them under x3 [puppet] - 10https://gerrit.wikimedia.org/r/1151692 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [13:47:07] (03PS1) 10Gkyziridis: ores-extension: enable ores extension for rrla without UI for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T391964) [13:47:07] (03CR) 10Marostegui: [C:03+2] db1211,db2162: Move them under x3 [puppet] - 10https://gerrit.wikimedia.org/r/1151692 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [13:47:28] (03PS2) 10Bking: apifeatureusage: switch to Observability-maintained curator [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) [13:47:41] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on an-druid1003 - https://phabricator.wikimedia.org/T393229#10863688 (10BTullis) Hi @Jclark-ctr - Sorry I didn't spot this ping. The auto-generated ticket only contained #sre and I don't monitor that board. Yes please, if you could replace the drive... [13:48:16] (03Merged) 10jenkins-bot: ganeti: make the storage_type configurable [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151665 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:48:17] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Replace failed disk in an-druid1003 - https://phabricator.wikimedia.org/T395450#10863693 (10BTullis) a:03Jclark-ctr [13:48:54] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Replace failed disk in an-druid1003 - https://phabricator.wikimedia.org/T395450#10863696 (10BTullis) [13:49:03] (03CR) 10Clément Goubert: [C:03+2] mw-cron, mw-script: Limit resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151689 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [13:49:21] (03CR) 10Btullis: [C:03+2] Airflow: increase resources to the analytics_test scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149342 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [13:49:28] (03CR) 10Btullis: [C:03+2] Airflow: Allow the scheduler to reach out to Hadoop on analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149341 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [13:49:32] (03CR) 10Btullis: [C:03+2] Airflow: Enable the LocalExecutor for the analytics_test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149340 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [13:49:57] ayounsi@cumin1002 peering (PID 90545) is awaiting input [13:50:12] (03CR) 10CI reject: [V:04-1] sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:50:41] (03Merged) 10jenkins-bot: mw-cron, mw-script: Limit resources [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151689 (https://phabricator.wikimedia.org/T395436) (owner: 10Clément Goubert) [13:51:01] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [13:51:10] (03Merged) 10jenkins-bot: Airflow: Enable the LocalExecutor for the analytics_test instance [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149340 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [13:51:11] (03Merged) 10jenkins-bot: Airflow: Allow the scheduler to reach out to Hadoop on analytics_test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149341 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [13:51:20] (03Merged) 10jenkins-bot: Airflow: increase resources to the analytics_test scheduler [deployment-charts] - 10https://gerrit.wikimedia.org/r/1149342 (https://phabricator.wikimedia.org/T394398) (owner: 10Btullis) [13:51:39] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [13:51:53] (03PS4) 10Muehlenhoff: sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) [13:52:30] (03PS3) 10Bking: apifeatureusage: switch to Observability-maintained curator [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) [13:52:53] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [13:53:42] 10ops-magru, 06DC-Ops, 10Observability-Metrics, 13Patch-For-Review, 10SRE Observability (FY2024/2025-Q3): missing pdu infos for magru - https://phabricator.wikimedia.org/T387231#10863710 (10tappof) 05Open→03Resolved [13:53:44] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1038 not booting up after reboot - https://phabricator.wikimedia.org/T395424#10863711 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Performed flea power drain server recovered and is up now [13:54:26] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:54:41] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [13:56:24] (03PS1) 10Marostegui: x3 masters: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1151695 (https://phabricator.wikimedia.org/T390530) [13:56:36] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10863740 (10Ladsgroup) I think it takes around one month and half to two months, not too much. [13:56:49] (03CR) 10D3r1ck01: [C:03+1] noc: Fix invalid `max-age: 300` syntax to `max-age=300` in fileserve.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151236 (https://phabricator.wikimedia.org/T341859) (owner: 10Krinkle) [13:57:06] (03CR) 10Ladsgroup: [C:03+1] "That has to be explicit? *annoyed noises*" [puppet] - 10https://gerrit.wikimedia.org/r/1151695 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [13:57:14] (03CR) 10Marostegui: "Yes" [puppet] - 10https://gerrit.wikimedia.org/r/1151695 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [13:57:37] (03CR) 10Marostegui: [C:03+2] x3 masters: Make them SBR [puppet] - 10https://gerrit.wikimedia.org/r/1151695 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [13:58:10] (03PS1) 10Volans: CHANGELOG: add changelogs for release v11.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151697 [13:58:31] (03PS1) 10Alexandros Kosiaris: jobqueue: Set the host header in all jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) [13:58:32] (03PS4) 10Bking: apifeatureusage: switch to Observability-maintained curator [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) [13:58:36] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v11.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151697 (owner: 10Volans) [13:58:41] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [13:58:52] (03CR) 10CI reject: [V:04-1] sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [13:59:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [14:00:00] (03PS1) 10Vgutierrez: hiera: Enable edge unique cookies in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151699 (https://phabricator.wikimedia.org/T391411) [14:00:05] Deploy window Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1400) [14:00:44] (03CR) 10Alexandros Kosiaris: "Adding Andrew, to double check my assertion that the domain would always be present in the job stanzas. Per https://schema.wikimedia.org/r" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151698 (https://phabricator.wikimedia.org/T395451) (owner: 10Alexandros Kosiaris) [14:01:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P76614 and previous config saved to /var/cache/conftool/dbconfig/20250528-140111-fceratto.json [14:01:41] !log Set s8 (wikidata) as RO to split x3 from it T351820 [14:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:45] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [14:01:54] 06SRE, 10SRE-Access-Requests, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Requesting access to deploy for KCVelaga - https://phabricator.wikimedia.org/T395125#10863766 (10brouberol) 05In progress→03Resolved We have worked around needing access to the `deployment` group to use... [14:01:55] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P76615 and previous config saved to /var/cache/conftool/dbconfig/20250528-140154-fceratto.json [14:02:33] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [14:02:47] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudlb1002.eqiad.wmnet with OS bookworm [14:02:58] * Amir1 is excited [14:03:31] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151699 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:04:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s8 (wikidata) as RO T351820', diff saved to https://phabricator.wikimedia.org/P76616 and previous config saved to /var/cache/conftool/dbconfig/20250528-140441-marostegui.json [14:06:32] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:07:15] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s1 T389373 [14:07:19] T389373: Switchover s1 master (db2212 -> db2203) - https://phabricator.wikimedia.org/T389373 [14:07:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:07:28] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 137511056 and 11 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:07:33] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Primary switchover s1 T389373 [14:07:54] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:08:25] (03CR) 10Vgutierrez: [C:03+2] hiera: Enable edge unique cookies in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1151699 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:08:26] 10SRE-tools, 06Infrastructure-Foundations, 10observability: Cookbook sre.hosts.remove_downtime does not remove silences - https://phabricator.wikimedia.org/T395032#10863796 (10lmata) [14:08:28] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 52080 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [14:08:52] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v11.0.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1151697 (owner: 10Volans) [14:13:08] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.5% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [14:13:38] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1155.eqiad.wmnet [14:15:09] (03PS1) 10Volans: Upstream release v11.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1151703 [14:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:15:25] (03CR) 10Volans: [C:03+2] Upstream release v11.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1151703 (owner: 10Volans) [14:15:38] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1155.eqiad.wmnet [14:15:42] (03PS4) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) [14:16:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P76621 and previous config saved to /var/cache/conftool/dbconfig/20250528-141619-fceratto.json [14:16:30] (03PS1) 10Clément Goubert: zarcillo: Add egress for prometheus and orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151705 [14:16:37] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1148.eqiad.wmnet [14:16:41] (03PS9) 10Jforrester: Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [14:16:42] (03CR) 10Jforrester: Use `sul` dblist in InitialiseSettings (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [14:16:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:16:52] (03CR) 10Jforrester: "PS9: Manual rebase/regen." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [14:16:56] (03CR) 10Jforrester: [C:03+1] Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [14:18:58] (03CR) 10Ladsgroup: [C:03+1] Use `sul` dblist in InitialiseSettings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1137480 (owner: 10BryanDavis) [14:20:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:20:53] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS bullseye [14:21:01] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10863881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [14:21:30] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [14:21:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:22:57] stevemunene@cumin1002 init-hadoop-workers (PID 126278) is awaiting input [14:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Change x3 masters', diff saved to https://phabricator.wikimedia.org/P76623 and previous config saved to /var/cache/conftool/dbconfig/20250528-142349-marostegui.json [14:23:59] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T395241)', diff saved to https://phabricator.wikimedia.org/P76624 and previous config saved to /var/cache/conftool/dbconfig/20250528-142359-fceratto.json [14:24:06] 10ops-eqiad, 06DC-Ops: hw troubleshooting: Replace failed disk in an-druid1003 - https://phabricator.wikimedia.org/T395450#10863900 (10Jclark-ctr) Replaced Failed drive updated bios and idrac [14:24:12] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1148.eqiad.wmnet [14:24:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:25:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Change x3 masters weights', diff saved to https://phabricator.wikimedia.org/P76625 and previous config saved to /var/cache/conftool/dbconfig/20250528-142503-marostegui.json [14:25:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10863911 (10Stevemunene) Ack thanks @Jclark-ctr , proceeding with the next steps [x] Raid0 config ` stevemunene@an-worke... [14:25:34] (03Merged) 10jenkins-bot: Upstream release v11.0.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1151703 (owner: 10Volans) [14:25:46] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10863912 (10Stevemunene) [14:26:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 32934 [14:27:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10863917 (10Arnoldokoth) Ack @MoritzMuehlenhoff Thank you. [14:27:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10863918 (10Arnoldokoth) a:03Kappakayala [14:27:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s8 RW T351820', diff saved to https://phabricator.wikimedia.org/P76626 and previous config saved to /var/cache/conftool/dbconfig/20250528-142745-marostegui.json [14:27:50] T351820: Move Wikidata term store to separate database cluster - https://phabricator.wikimedia.org/T351820 [14:27:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (and kerberos) for Giuseppe Lavagetto - https://phabricator.wikimedia.org/T395413#10863921 (10Arnoldokoth) 05Open→03In progress [14:28:24] jouncebot: now [14:28:24] For the next 0 hour(s) and 31 minute(s): Wikifunctions Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1400) [14:28:36] dcausse: the alert for wdqs? [14:28:36] (03PS1) 10Clément Goubert: zarcillo: Add noc env var and mariadb sections egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151707 [14:28:43] it's us, the db was read only [14:28:50] oh ok! [14:29:19] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1231.eqiad.wmnet with reason: Maintenance [14:29:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1231 (T395241)', diff saved to https://phabricator.wikimedia.org/P76627 and previous config saved to /var/cache/conftool/dbconfig/20250528-142926-fceratto.json [14:30:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users, SSH and Kerberos for GGoncalves-WMF - https://phabricator.wikimedia.org/T395428#10863933 (10Arnoldokoth) 05Open→03In progress a:03VirginiaPoundstone [14:31:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T395241)', diff saved to https://phabricator.wikimedia.org/P76628 and previous config saved to /var/cache/conftool/dbconfig/20250528-143126-fceratto.json [14:31:46] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2205.codfw.wmnet with reason: Maintenance [14:31:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:ae5 (External: Facebook) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [14:31:54] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T395241)', diff saved to https://phabricator.wikimedia.org/P76629 and previous config saved to /var/cache/conftool/dbconfig/20250528-143153-fceratto.json [14:32:06] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10863947 (10MoritzMuehlenhoff) [14:32:22] (03PS2) 10Clément Goubert: zarcillo: Add noc env var and mariadb sections egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151707 [14:32:42] (03PS1) 10Stevemunene: hdfs: re add group5 hosts and remove damaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151709 (https://phabricator.wikimedia.org/T390172) [14:33:01] (03PS5) 10Hnowlan: (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) [14:33:25] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10863960 (10Arnoldokoth) [14:33:59] (03PS1) 10Vgutierrez: hiera: Unify edge uniques settings [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) [14:34:30] marostegui@cumin1002 clone (PID 4068646) is awaiting input [14:34:47] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2242.codfw.wmnet onto db2187.codfw.wmnet [14:35:35] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [14:36:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76630 and previous config saved to /var/cache/conftool/dbconfig/20250528-143603-root.json [14:36:47] 06SRE, 10SRE-Access-Requests: Requesting access to deployment for kgraessle - https://phabricator.wikimedia.org/T395370#10863972 (10Arnoldokoth) 05Open→03In progress [14:36:49] (03PS1) 10Marostegui: db2187: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151712 (https://phabricator.wikimedia.org/T390530) [14:37:02] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T395241)', diff saved to https://phabricator.wikimedia.org/P76631 and previous config saved to /var/cache/conftool/dbconfig/20250528-143702-fceratto.json [14:37:06] (03CR) 10Hnowlan: [C:03+2] (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) (owner: 10Hnowlan) [14:37:20] (03CR) 10Marostegui: [C:03+2] db2187: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1151712 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [14:38:13] RESOLVED: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [14:38:40] (03Merged) 10jenkins-bot: (api|rest)-gateway: log 5xx errors by default [deployment-charts] - 10https://gerrit.wikimedia.org/r/1147741 (https://phabricator.wikimedia.org/T394584) (owner: 10Hnowlan) [14:39:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T395241)', diff saved to https://phabricator.wikimedia.org/P76632 and previous config saved to /var/cache/conftool/dbconfig/20250528-143910-fceratto.json [14:40:32] !log dancy@deploy1003 Installing scap version "4.171.0" for 2 host(s) [14:41:36] (03CR) 10Gkyziridis: [C:03+1] "I think we are missing some of the models, e.g. `article-descriptions`, `recommendation-api-ng`, and `revertscoring-*`. Lets add the lates" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:42:24] !log dancy@deploy1003 Installation of scap version "4.171.0" completed for 2 hosts [14:42:28] (03CR) 10Gkyziridis: [C:03+1] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [14:42:36] (03PS1) 10Marostegui: db1211,db2162: Make them candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/1151714 (https://phabricator.wikimedia.org/T390530) [14:42:49] (03CR) 10Btullis: [C:03+1] hdfs: re add group5 hosts and remove damaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151709 (https://phabricator.wikimedia.org/T390172) (owner: 10Stevemunene) [14:43:30] (03CR) 10Marostegui: [C:03+2] db1211,db2162: Make them candidate masters [puppet] - 10https://gerrit.wikimedia.org/r/1151714 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [14:44:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: es1038 not booting up after reboot - https://phabricator.wikimedia.org/T395424#10864025 (10Marostegui) Thank you! All good from my side now too [14:46:06] RECOVERY - Juniper alarms on cr2-codfw is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:47:00] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/api-gateway: apply [14:47:08] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [14:47:20] (03CR) 10Stevemunene: [C:03+2] hdfs: re add group5 hosts and remove damaged hosts [puppet] - 10https://gerrit.wikimedia.org/r/1151709 (https://phabricator.wikimedia.org/T390172) (owner: 10Stevemunene) [14:47:59] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/api-gateway: apply [14:48:24] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [14:49:05] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.11 point update - https://phabricator.wikimedia.org/T394489#10864053 (10MoritzMuehlenhoff) [14:49:20] !log uploaded spicerack_11.0.0 to apt.wikimedia.org bullseye-wikimedia,bookworm-wikimedia [14:49:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:56] (03PS1) 10Elukey: sre.hosts.provision: allow more Drive types for Dell's NVMe settings [cookbooks] - 10https://gerrit.wikimedia.org/r/1151717 (https://phabricator.wikimedia.org/T392844) [14:50:13] !log installing twitter-bootstrap3 security updates [14:50:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:38] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151717 (https://phabricator.wikimedia.org/T392844) (owner: 10Elukey) [14:51:00] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage [14:51:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76633 and previous config saved to /var/cache/conftool/dbconfig/20250528-145108-root.json [14:51:12] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [14:51:16] 10ops-eqiad, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install apus-be1004 - https://phabricator.wikimedia.org/T392844#10864061 (10elukey) @Jclark-ctr @MatthewVernon I'd need to test https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151717, do you have another hos... [14:51:22] (03CR) 10Elukey: "Pending proper testing on a host, if possible." [cookbooks] - 10https://gerrit.wikimedia.org/r/1151717 (https://phabricator.wikimedia.org/T392844) (owner: 10Elukey) [14:51:38] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [14:52:09] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1148.eqiad.wmnet [14:52:09] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P76634 and previous config saved to /var/cache/conftool/dbconfig/20250528-145209-fceratto.json [14:52:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10864065 (10ops-monitoring-bot) Host an-worker1148.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [14:54:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P76635 and previous config saved to /var/cache/conftool/dbconfig/20250528-145418-fceratto.json [14:54:43] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudlb1002.eqiad.wmnet with reason: host reimage [14:55:35] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/rest-gateway: apply [14:55:44] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [14:56:40] (03CR) 10Volans: [C:03+1] "LGTM, one nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [14:56:56] (03CR) 10Muehlenhoff: "check experimental" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [14:57:39] (03CR) 10Federico Ceratto: [C:03+1] "LGTM, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151705 (owner: 10Clément Goubert) [14:58:23] (03PS5) 10Muehlenhoff: sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) [14:58:32] (03CR) 10Muehlenhoff: sre.ganeti.makevm: Support passing a non-DRBD storage type (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [14:58:42] (03CR) 10Federico Ceratto: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151707 (owner: 10Clément Goubert) [14:58:50] (03CR) 10Clément Goubert: [C:03+2] zarcillo: Add egress for prometheus and orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151705 (owner: 10Clément Goubert) [14:58:56] (03CR) 10Clément Goubert: [C:03+2] zarcillo: Add noc env var and mariadb sections egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151707 (owner: 10Clément Goubert) [15:00:09] (03Merged) 10jenkins-bot: zarcillo: Add egress for prometheus and orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151705 (owner: 10Clément Goubert) [15:00:15] (03Merged) 10jenkins-bot: zarcillo: Add noc env var and mariadb sections egress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151707 (owner: 10Clément Goubert) [15:01:35] (03CR) 10Bartosz Wójtowicz: "I don't see `article-descriptions` and `revscoring-*` models having any image/tag configuration within the `values-ml-staging-codfw.yaml` " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:02:24] (03CR) 10Ilias Sarantopoulos: "+1 All images in files named as `values-ml-staging-codfw.yaml` that exist in any of the directories in helmfile.d/ml-services should be up" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:02:59] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [15:03:12] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [15:05:36] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:06:03] (03CR) 10Ilias Sarantopoulos: [C:03+1] "Sorry! Taking a closer look, I just understood what you mean." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:06:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76636 and previous config saved to /var/cache/conftool/dbconfig/20250528-150614-root.json [15:07:17] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P76637 and previous config saved to /var/cache/conftool/dbconfig/20250528-150716-fceratto.json [15:07:19] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [15:08:36] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:09:26] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P76638 and previous config saved to /var/cache/conftool/dbconfig/20250528-150925-fceratto.json [15:09:36] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1148.eqiad.wmnet [15:09:36] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:10:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:20] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1155.eqiad.wmnet [15:10:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10864171 (10ops-monitoring-bot) Host an-worker1155.eqiad.wmnet rebooted by stevemunene@cumin1002 w... [15:10:35] (03PS1) 10Andrew Bogott: Upgrade Openstack to version 'epoxy' in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1151720 (https://phabricator.wikimedia.org/T390914) [15:10:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151720 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [15:11:36] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:13:07] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet [15:13:09] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [15:13:55] (03PS1) 10Marostegui: Revert "es1038: Host down" [puppet] - 10https://gerrit.wikimedia.org/r/1151724 [15:14:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P76639 and previous config saved to /var/cache/conftool/dbconfig/20250528-151401-root.json [15:14:27] (03CR) 10Marostegui: [C:03+2] Revert "es1038: Host down" [puppet] - 10https://gerrit.wikimedia.org/r/1151724 (owner: 10Marostegui) [15:14:36] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudlb1002.eqiad.wmnet with OS bookworm [15:15:12] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:02] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10864229 (10VRiley-WMF) Placing in blocked for the time being while we wait for the RMA [15:16:49] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:16:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:16:54] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:54] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors [15:16:57] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors [15:17:12] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [15:17:31] (03CR) 10Andrew Bogott: [C:03+2] Upgrade Openstack to version 'epoxy' in eqiad1 [puppet] - 10https://gerrit.wikimedia.org/r/1151720 (https://phabricator.wikimedia.org/T390914) (owner: 10Andrew Bogott) [15:17:33] !log bking@mwmaint1002 ban Elastic/CS hosts prior to decom T394350 [15:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:37] T394350: decommission cirrussearch1053.eqiad.wmnet + more (see description) - https://phabricator.wikimedia.org/T394350 [15:17:42] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:17:48] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:32] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 53941 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:18:39] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 8922 bytes in 0.157 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:20:22] (03CR) 10Bartosz Wójtowicz: "I see, thank you! For those cases, I'll update the `values.yaml` in the next patch, which will target production environment models." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:20:32] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:20:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:20:37] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:20:37] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors [15:20:38] !log cgoubert@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [15:20:40] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors [15:20:44] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7003.magru.wmnet [15:21:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76640 and previous config saved to /var/cache/conftool/dbconfig/20250528-152120-root.json [15:21:45] (03PS3) 10Scott French: deployment_server: deploy the mediawiki-dumps-legacy scap target [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [15:21:49] !log jmm@cumin1003 START - Cookbook sre.ganeti.makevm for new host ncredir7003.magru.wmnet [15:21:51] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [15:22:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T395241)', diff saved to https://phabricator.wikimedia.org/P76641 and previous config saved to /var/cache/conftool/dbconfig/20250528-152223-fceratto.json [15:22:24] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [15:22:32] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Update docker image tags for ML staging models. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:24:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T395241)', diff saved to https://phabricator.wikimedia.org/P76642 and previous config saved to /var/cache/conftool/dbconfig/20250528-152433-fceratto.json [15:24:36] RECOVERY - Hadoop NodeManager on an-worker1119 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:24:53] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2227.codfw.wmnet with reason: Maintenance [15:24:59] (03Merged) 10jenkins-bot: ml-services: Update docker image tags for ML staging models. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150696 (https://phabricator.wikimedia.org/T393865) (owner: 10Bartosz Wójtowicz) [15:25:00] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2227 (T395241)', diff saved to https://phabricator.wikimedia.org/P76643 and previous config saved to /var/cache/conftool/dbconfig/20250528-152459-fceratto.json [15:25:00] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:25:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:25:05] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:25:05] !log jmm@cumin1003 START - Cookbook sre.dns.wipe-cache ncredir7003.magru.wmnet on all recursors [15:25:08] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) ncredir7003.magru.wmnet on all recursors [15:25:31] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:25:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM ncredir7003.magru.wmnet - jmm@cumin1003" [15:26:06] (03PS1) 10Marostegui: mariadb: Switch from s8 to x3 [puppet] - 10https://gerrit.wikimedia.org/r/1151726 (https://phabricator.wikimedia.org/T390530) [15:26:30] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host ncredir7003.magru.wmnet with OS bookworm [15:26:38] 06SRE, 10Ganeti, 06Infrastructure-Foundations, 13Patch-For-Review: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10864321 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm [15:26:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr1-eqiad:xe-3/2/1 (Peering: Facebook (FC-5205147) {#2648}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [15:27:36] PROBLEM - Hadoop NodeManager on an-worker1119 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [15:27:38] (03CR) 10Muehlenhoff: "Tested via test-cookbook on cumin1003" [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [15:27:42] (03CR) 10Muehlenhoff: [C:03+2] sre.ganeti.makevm: Support passing a non-DRBD storage type [cookbooks] - 10https://gerrit.wikimedia.org/r/1151678 (https://phabricator.wikimedia.org/T394263) (owner: 10Muehlenhoff) [15:27:53] !log bwojtowicz@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [15:28:09] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1155.eqiad.wmnet [15:29:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P76644 and previous config saved to /var/cache/conftool/dbconfig/20250528-152907-root.json [15:29:35] (03CR) 10Marostegui: [C:03+2] mariadb: Switch from s8 to x3 [puppet] - 10https://gerrit.wikimedia.org/r/1151726 (https://phabricator.wikimedia.org/T390530) (owner: 10Marostegui) [15:29:37] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:32:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T395241)', diff saved to https://phabricator.wikimedia.org/P76645 and previous config saved to /var/cache/conftool/dbconfig/20250528-153220-fceratto.json [15:32:37] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:33:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10864361 (10Stevemunene) an worker1148 and 1155 have rejoined the cluster and are balancing {F60772867} {F60772869} [15:33:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10864363 (10Stevemunene) [15:34:42] (03CR) 10Scott French: [C:03+1] "Alright, [1] has been merged and nearly 24h has passed since it was deployed. I'm willing to call that rollback-safe, so let's move forwar" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [15:35:55] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10864383 (10Arnoldokoth) @KFrancis Please work with @Corvus on the NDA. [15:36:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Upgrade an-worker hard drives from 4TB to 8TB (group 5 - rack F1) - https://phabricator.wikimedia.org/T390172#10864385 (10Stevemunene) 05Open→03Resolved [15:36:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76646 and previous config saved to /var/cache/conftool/dbconfig/20250528-153625-root.json [15:37:08] jouncebot: nowandnext [15:37:08] No deployments scheduled for the next 1 hour(s) and 22 minute(s) [15:37:08] In 1 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1700) [15:39:37] PROBLEM - BFD status on cloudsw1-d5-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:42:37] RECOVERY - BFD status on cloudsw1-d5-eqiad.mgmt is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:43:09] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.5% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [15:44:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10864463 (10VRiley-WMF) [15:44:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P76647 and previous config saved to /var/cache/conftool/dbconfig/20250528-154412-root.json [15:45:22] (03CR) 10Stevemunene: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [15:46:28] (03PS1) 10Eevans: cassandra-dev2003: use d-i preseed for JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1151730 (https://phabricator.wikimedia.org/T391544) [15:47:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P76648 and previous config saved to /var/cache/conftool/dbconfig/20250528-154726-fceratto.json [15:48:14] (03CR) 10Bking: [C:03+2] apifeatureusage: switch to Observability-maintained curator [puppet] - 10https://gerrit.wikimedia.org/r/1151687 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [15:48:33] PROBLEM - BFD status on cloudsw1-c8-eqiad.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:49:34] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:50:27] (03CR) 10Eevans: [C:03+2] cassandra-dev2003: use d-i preseed for JBOD config [puppet] - 10https://gerrit.wikimedia.org/r/1151730 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [15:51:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76649 and previous config saved to /var/cache/conftool/dbconfig/20250528-155130-root.json [15:52:24] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10864545 (10VRiley-WMF) [15:52:33] RECOVERY - BFD status on cloudsw1-c8-eqiad.mgmt is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:55:01] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [15:55:26] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=93) for host an-worker1185.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:55:28] (03PS1) 10Bartosz Wójtowicz: ml-services: Update STORAGE_URI for articlequality model. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1151734 (https://phabricator.wikimedia.org/T393865) [15:56:13] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [15:57:27] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:58:17] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [15:58:34] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:59:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 20%: Repooling', diff saved to https://phabricator.wikimedia.org/P76650 and previous config saved to /var/cache/conftool/dbconfig/20250528-155918-root.json [16:02:00] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:02:02] RECOVERY - Squid on install1004 is OK: TCP OK - 0.000 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [16:02:18] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 565 bytes in 2.064 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:02:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227', diff saved to https://phabricator.wikimedia.org/P76651 and previous config saved to /var/cache/conftool/dbconfig/20250528-160233-fceratto.json [16:03:06] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [16:03:34] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:05:00] (03PS13) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [16:05:49] (03PS14) 10Effie Mouzeli: validating-admission-policies: add policy to permit hostPath mounts for mediawiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/1146992 (https://phabricator.wikimedia.org/T395225) [16:06:23] (03PS1) 10Foks: SecurePoll: Adding files for U4C vote 2025 [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1151740 (https://phabricator.wikimedia.org/T395386) [16:06:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76652 and previous config saved to /var/cache/conftool/dbconfig/20250528-160636-root.json [16:07:14] (03PS1) 10Foks: SecurePoll: Adding files for U4C vote 2025 [extensions/SecurePoll] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151741 (https://phabricator.wikimedia.org/T395386) [16:07:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1151740 (https://phabricator.wikimedia.org/T395386) (owner: 10Foks) [16:08:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/SecurePoll] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151741 (https://phabricator.wikimedia.org/T395386) (owner: 10Foks) [16:08:16] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye [16:08:32] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10864709 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host cassandra-dev2003.... [16:11:05] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151711 (https://phabricator.wikimedia.org/T391411) (owner: 10Vgutierrez) [16:12:07] (03PS3) 10Brouberol: airflow: emit lineage metadata to datahub via kafka instead of the GMS REST API [deployment-charts] - 10https://gerrit.wikimedia.org/r/1150595 (https://phabricator.wikimedia.org/T395106) [16:13:10] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.5% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:14:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 40%: Repooling', diff saved to https://phabricator.wikimedia.org/P76653 and previous config saved to /var/cache/conftool/dbconfig/20250528-161423-root.json [16:14:30] (03CR) 10Kgraessle: [C:03+1] ores-extension: enable ores extension for rrla without UI for simplewiki and trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151693 (https://phabricator.wikimedia.org/T391964) (owner: 10Gkyziridis) [16:17:41] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2227 (T395241)', diff saved to https://phabricator.wikimedia.org/P76654 and previous config saved to /var/cache/conftool/dbconfig/20250528-161740-fceratto.json [16:18:00] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2239.codfw.wmnet with reason: Maintenance [16:18:13] !log stevemunene@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-worker1119.eqiad.wmnet with reason: Repair data node volume failure [16:19:45] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ncredir7003.magru.wmnet with OS bookworm [16:19:45] !log jmm@cumin1003 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host ncredir7003.magru.wmnet [16:19:54] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Migrating magru to routed Ganeti - https://phabricator.wikimedia.org/T394263#10864800 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host ncredir7003.magru.wmnet with OS bookworm executed with errors: - ncredir7003 (**FA... [16:19:59] (03CR) 10Btullis: "Acknowledged" [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [16:21:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76655 and previous config saved to /var/cache/conftool/dbconfig/20250528-162142-root.json [16:21:50] 06SRE, 06serviceops: Silence RESTGatewayBackendErrorsHigh for envoy_cluster_name: mobileapps_cluster - https://phabricator.wikimedia.org/T394609#10864810 (10hnowlan) I think this can probably be closed given the improvements to mobileapps since. [16:23:50] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [16:26:22] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [16:29:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 60%: Repooling', diff saved to https://phabricator.wikimedia.org/P76656 and previous config saved to /var/cache/conftool/dbconfig/20250528-162928-root.json [16:30:37] (03PS1) 10Bvibber: Enable Lua transform switch for Charts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151747 (https://phabricator.wikimedia.org/T388616) [16:36:06] (03PS4) 10BCornwall: cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [16:36:28] !log fceratto@deploy1003 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [16:39:09] (03PS1) 10Clément Goubert: mw::maintenance::cirrussearch: foreachwiki ignore errors [puppet] - 10https://gerrit.wikimedia.org/r/1151749 (https://phabricator.wikimedia.org/T388538) [16:40:19] !log xcollazo@deploy1003 Started deploy [airflow-dags/analytics@afad011]: Deploy latest DAGs for main Airflow instance. T385112. [16:40:23] T385112: Investigate reasons for remaining inconsistencies - https://phabricator.wikimedia.org/T385112 [16:40:59] !log xcollazo@deploy1003 Finished deploy [airflow-dags/analytics@afad011]: Deploy latest DAGs for main Airflow instance. T385112. (duration: 00m 39s) [16:41:29] (03CR) 10Ebernhardson: [C:03+1] mw::maintenance::cirrussearch: foreachwiki ignore errors [puppet] - 10https://gerrit.wikimedia.org/r/1151749 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [16:41:32] (03PS1) 10Jforrester: Enable Chart for Phase 4 wikis (all remaining public wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151750 (https://phabricator.wikimedia.org/T393788) [16:41:33] (03PS1) 10Jforrester: Drop Chart roll-out dblists, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) [16:41:56] (03CR) 10Dzahn: [C:03+2] "removing comments - only" [puppet] - 10https://gerrit.wikimedia.org/r/1137485 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:42:22] (03CR) 10CI reject: [V:04-1] Enable Chart for Phase 4 wikis (all remaining public wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151750 (https://phabricator.wikimedia.org/T393788) (owner: 10Jforrester) [16:42:25] (03CR) 10Dzahn: [C:03+2] profile: delete static_rt profile and erb template [puppet] - 10https://gerrit.wikimedia.org/r/1151382 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:43:06] (03CR) 10CI reject: [V:04-1] cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [16:43:10] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [16:43:19] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS bullseye [16:43:32] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10864902 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [16:43:33] (03CR) 10Clément Goubert: [C:03+2] mw::maintenance::cirrussearch: foreachwiki ignore errors [puppet] - 10https://gerrit.wikimedia.org/r/1151749 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [16:43:34] FIRING: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:44:20] mutante: can I merge Daniel Zahn: profile: delete static_rt profile and erb template (539faebf34) ? [16:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P76657 and previous config saved to /var/cache/conftool/dbconfig/20250528-164433-root.json [16:44:37] claime: yes, I was about to say that. just go ahead please [16:44:39] ack [16:44:42] (03CR) 10Ssingh: "Sorry for the late review. I also see a bunch of stuff in modules/profile/templates/microsites/static-rt.wikimedia.org.erb related to this" [puppet] - 10https://gerrit.wikimedia.org/r/1137485 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [16:44:44] the profile is not used by a role [16:44:53] running [16:45:11] thx [16:45:19] (03PS5) 10BCornwall: cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [16:46:48] (03PS6) 10BCornwall: cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [16:48:57] (03PS1) 10Effie Mouzeli: deployment:fix-staging-perm: update fix-staging-perms [puppet] - 10https://gerrit.wikimedia.org/r/1151753 (https://phabricator.wikimedia.org/T276994) [16:52:51] (03PS1) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) [16:53:13] (03CR) 10CI reject: [V:04-1] apifeatureusage: switch to Observability-maintained curator, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [16:53:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1066-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:53:56] (03PS2) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) [16:56:38] (03PS2) 10Dzahn: aptrepo: add thirdparty/ci component to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) [16:56:52] (03PS1) 10Clément Goubert: sharded_periodic_job: Allow setting foreachwiki_ignore_errors [puppet] - 10https://gerrit.wikimedia.org/r/1151755 (https://phabricator.wikimedia.org/T388538) [16:57:01] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151755 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [16:57:57] (03CR) 10Dzahn: "@mmuhlenhoff@wikimedia.org Ok, so if we want to rename it to thirdparty/jenkins as you suggest.. then what else do I have to do besides ad" [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [16:58:15] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [16:58:39] (03PS2) 10Jforrester: Enable Chart for Phase 4 wikis (all remaining public wikis) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151750 (https://phabricator.wikimedia.org/T393788) [16:58:39] (03PS2) 10Jforrester: Drop Chart roll-out dblists, no longer needed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151751 (https://phabricator.wikimedia.org/T383079) [16:59:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1038 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P76658 and previous config saved to /var/cache/conftool/dbconfig/20250528-165939-root.json [17:00:05] swfrench-wmf: OwO what's this, a deployment window?? MediaWiki infrastructure (UTC late). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1700). nyaa~ [17:00:23] o/ [17:00:26] (03PS3) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) [17:00:43] o/ [17:00:59] great, let's get this going! :) [17:01:23] (03PS1) 10HMonroy: InitialiseSettings: enable multiblocks on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) [17:01:32] (03CR) 10Scott French: [C:03+2] deployment_server: deploy the mediawiki-dumps-legacy scap target [puppet] - 10https://gerrit.wikimedia.org/r/1148203 (https://phabricator.wikimedia.org/T389786) (owner: 10Brouberol) [17:01:56] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [17:02:01] (03PS2) 10Clément Goubert: sharded_periodic_job: Allow setting foreachwiki_ignore_errors [puppet] - 10https://gerrit.wikimedia.org/r/1151755 (https://phabricator.wikimedia.org/T388538) [17:02:32] swfrench-wmf: Ack. [17:03:39] FIRING: [3x] CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1060-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:03:50] btullis: running puppet on the deployment host, which should take 3-4m. once that's done, we can try out the deployment . [17:04:23] Currently I see this, which is a little unexpected. I would have expected to see the most recently (manually) deployed job template. [17:04:27] https://www.irccloud.com/pastebin/IqfU8BCS/ [17:04:32] (03CR) 10Clément Goubert: [C:03+2] sharded_periodic_job: Allow setting foreachwiki_ignore_errors [puppet] - 10https://gerrit.wikimedia.org/r/1151755 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [17:05:00] We have seen it disappearing before, but not yet worked out why. [17:05:43] btullis: last job should have been when? [17:05:46] They have a ttl [17:06:41] claime: Ah, that is interesting. Our job is supposed to be like a template, because it is a suspended job, so it should last forever. [17:06:46] hmmm [17:06:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Install and cable Nokia test devices and test servers in codfw - https://phabricator.wikimedia.org/T385217#10865012 (10cmooney) Nokia have been on asking to get the test kit back. Mostly this will fall to DC-Ops but if you can leave m... [17:07:21] https://www.irccloud.com/pastebin/DyNKgzOX/ [17:08:00] sus.png [17:08:09] that's ... interesting [17:08:44] (03PS1) 10Dzahn: doc: add support for PHP version bookworm, drop php_prefix variable [puppet] - 10https://gerrit.wikimedia.org/r/1151757 (https://phabricator.wikimedia.org/T392130) [17:08:50] Is it this? [17:08:55] `ttlSecondsAfterFinished: 604800 # 7 days` [17:09:05] yeah [17:09:16] but I don't think it applies if the job was never started [17:09:29] ^ this is the surprising part [17:10:04] OK, maybe we could just if-gate it in the chart if: [17:10:09] https://www.irccloud.com/pastebin/qVhc73aq/ [17:10:12] (03CR) 10Dzahn: doc: add php8.1 support for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:10:26] (03CR) 10Dzahn: "please see https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151757" [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:10:55] (03CR) 10AOkoth: doc: add php8.1 support for bookworm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:11:24] I don't think it's a problem we need to fix right now, though. [17:11:34] (03Abandoned) 10AOkoth: doc: add php8.1 support for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1151273 (https://phabricator.wikimedia.org/T392130) (owner: 10AOkoth) [17:12:00] btullis: alright, I think we're ready to go if you're comfortable with the current state [17:12:05] For now I'm just tailing events in the dse-ks8-eqiad/mediawiki-dumps-legacy namespace with: `kubectl get events -w` [17:12:29] Yep, I'm happy to proceed and we can look at the disappearing template at another time. [17:12:58] sounds good - off we go, then! [17:13:35] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1151757/5706/" [puppet] - 10https://gerrit.wikimedia.org/r/1151757 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [17:14:27] `Error: Kubernetes cluster unreachable: error loading config file "/etc/kubernetes/mediawiki-dumps-legacy-deploy-dse-k8s-eqiad.config": open /etc/kubernetes/mediawiki-dumps-legacy-deploy-dse-k8s-eqiad.config: permission denied` [17:14:29] ha [17:14:42] alright, we'll need to sort that out first :) [17:15:33] Oh, ok. [17:15:54] (03PS1) 10Scott French: Revert "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1151758 (https://phabricator.wikimedia.org/T389786) [17:16:42] (03CR) 10Scott French: [C:03+2] Revert "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1151758 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [17:17:46] btullis: reverting the prior state. I'll follow up on the task afterward about next steps. [17:17:50] *reverting to [17:17:58] thanks for sticking around! [17:18:12] (03PS1) 10Jasmine: wikikube: decommission wikikube-worker102[6-8].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151759 (https://phabricator.wikimedia.org/T383227) [17:18:18] (03PS1) 10Clément Goubert: sharded_periodic_job: Fix false case [puppet] - 10https://gerrit.wikimedia.org/r/1151760 (https://phabricator.wikimedia.org/T388538) [17:18:19] (03CR) 10Dzahn: [V:04-1] "This does not work as expected. It seems to create a remote literally called "replica_settings". https://puppet-compiler.wmflabs.org/outpu" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:18:32] (03CR) 10CI reject: [V:04-1] sharded_periodic_job: Fix false case [puppet] - 10https://gerrit.wikimedia.org/r/1151760 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [17:18:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1064-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:19:12] I'll make a patch to remove the owner:group overrides here. https://github.com/wikimedia/operations-puppet/commit/a4f39570e1b1b936f3c6201ed9563c18e2befc16 [17:19:15] (03CR) 10Dzahn: [V:04-1 C:04-1] gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:19:45] (03PS2) 10Clément Goubert: sharded_periodic_job: Fix false case [puppet] - 10https://gerrit.wikimedia.org/r/1151760 (https://phabricator.wikimedia.org/T388538) [17:20:28] (03CR) 10MusikAnimal: InitialiseSettings: enable multiblocks on group1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [17:20:44] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on elastic[1054,1067,1103].eqiad.wmnet with reason: downtime until decom [17:21:37] 06SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure (Zuul upgrade): Requesting access to contint-roots for Corvus - https://phabricator.wikimedia.org/T395167#10865085 (10KFrancis) Hi @Corvus, I can help you with an NDA. To process the request, I'll need your full name and mailing address.... [17:22:21] (03CR) 10Hnowlan: [C:03+1] sharded_periodic_job: Fix false case [puppet] - 10https://gerrit.wikimedia.org/r/1151760 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [17:22:30] (03PS1) 10Btullis: mediawiki-dumps-legacy: Remove user:group overrides for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1151763 (https://phabricator.wikimedia.org/T389786) [17:22:50] swfrench-wmf: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1151763 [17:23:30] (03CR) 10Clément Goubert: [C:03+2] sharded_periodic_job: Fix false case [puppet] - 10https://gerrit.wikimedia.org/r/1151760 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [17:24:18] (03PS2) 10HMonroy: InitialiseSettings: enable multiblocks on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) [17:24:39] (03CR) 10HMonroy: InitialiseSettings: enable multiblocks on group1 (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [17:24:56] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [17:25:20] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5707/co" [puppet] - 10https://gerrit.wikimedia.org/r/1151763 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [17:25:32] (03CR) 10MusikAnimal: [C:03+1] InitialiseSettings: enable multiblocks on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [17:26:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1151757 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [17:26:51] claime: all yours! [17:27:36] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:29:23] (03CR) 10Scott French: [C:03+1] "Thanks for catching this!" [puppet] - 10https://gerrit.wikimedia.org/r/1151763 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [17:29:38] (03CR) 10Btullis: [V:03+1 C:03+2] mediawiki-dumps-legacy: Remove user:group overrides for k8s config [puppet] - 10https://gerrit.wikimedia.org/r/1151763 (https://phabricator.wikimedia.org/T389786) (owner: 10Btullis) [17:29:48] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-druid1003.eqiad.wmnet with OS bullseye [17:30:47] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hmonroy@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [17:31:32] (03Merged) 10jenkins-bot: InitialiseSettings: enable multiblocks on group1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151756 (https://phabricator.wikimedia.org/T377121) (owner: 10HMonroy) [17:31:48] FIRING: PuppetFailure: Puppet has failed on apifeatureusage1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [17:31:53] !log hmonroy@deploy1003 Started scap sync-world: Backport for [[gerrit:1151756|InitialiseSettings: enable multiblocks on group1 (T377121)]] [17:31:55] (03CR) 10Muehlenhoff: "You also need to update the updates config" [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) (owner: 10Dzahn) [17:31:56] claime: FYI, there's a backport running ^^ (?) [17:31:57] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [17:32:14] ack [17:32:18] it'll be fine [17:32:26] it won't affect running jobs anyways [17:32:38] at worst the backport will apply the change [17:32:43] Running puppet on deploy1003 now to update the permissions on `/etc/kubernetes/mediawiki-dumps-legacy-deploy-dse-k8s-eqiad.config` [17:32:57] btullis: no you don't :D [17:33:00] because I already am [17:33:18] (03CR) 10AOkoth: [C:03+1] doc: add support for PHP version bookworm, drop php_prefix variable [puppet] - 10https://gerrit.wikimedia.org/r/1151757 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [17:33:23] Oh, ok. Thanks. [17:33:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1061-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [17:34:02] !log hmonroy@deploy1003 hmonroy: Backport for [[gerrit:1151756|InitialiseSettings: enable multiblocks on group1 (T377121)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:35:08] musikanimal: how does test server look? [17:35:10] (03PS7) 10BCornwall: cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [17:36:33] looking [17:37:09] jouncebot: nowandnext [17:37:09] For the next 0 hour(s) and 22 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1700) [17:37:09] In 0 hour(s) and 22 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T1800) [17:37:13] hmonroy: looks good to me [17:37:21] !log hmonroy@deploy1003 hmonroy: Continuing with sync [17:38:06] y'all realise you're outside of a backport window, and we're trying to do infrastructure deployments? [17:38:22] It's ok, you've just applied my changes without realising it [17:38:38] but this was an infra window with things scheduled in there [17:38:56] just next time, ask if it's ok, or use a backport window [17:39:16] claime: my apologies, I checked the schedule, and didn't see the infrastructure deployment [17:39:30] claime: I will ask next time [17:39:33] thanks [17:39:52] btullis: puppet run done [17:39:53] claime: apologies again for the inconvenience [17:40:19] (03PS8) 10BCornwall: cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 [17:40:46] btullis: i don't think it pulled your changes, so you can go ahead and run puppet again x) [17:40:55] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw: BAD PEM3 on cr2-codfw - https://phabricator.wikimedia.org/T394868#10865143 (10cmooney) 05Open→03Resolved All is now good with this one, thanks for sorting it @Jhancock.wm ! ` cmooney@re0.cr2-codfw> show chassis environme... [17:41:23] claime: Ack, doing so now. [17:41:28] (03CR) 10Ssingh: [C:03+1] cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [17:42:51] (03CR) 10BCornwall: cdn: Fix "reason" variable reference (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [17:43:09] (03CR) 10BCornwall: [C:03+2] cdn: Fix "reason" variable reference [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [17:43:10] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [17:43:36] (03CR) 10BCornwall: [V:03+2 C:03+2] "`" [cookbooks] - 10https://gerrit.wikimedia.org/r/1146739 (owner: 10BCornwall) [17:44:00] (03PS1) 10Btullis: Allow an-druid1003 to reformat its data drives [puppet] - 10https://gerrit.wikimedia.org/r/1151766 (https://phabricator.wikimedia.org/T393229) [17:45:56] https://www.irccloud.com/pastebin/EeSRn18O/ [17:46:41] btullis: I can now load the mediawiki-dumps-legacy kube configs just w/ ambient access. thanks! [18:07:06] (03PS8) 10AOkoth: wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) [18:07:44] !log swfrench@deploy1003 Finished scap sync-world: Scap deployment to put production in a consistent state - T377121 (duration: 07m 48s) [18:07:48] T377121: Deploy Codex Special:Block / Multiblocks - https://phabricator.wikimedia.org/T377121 [18:08:07] dancy: all yours! [18:08:13] (03CR) 10Joal: [C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1151769 (https://phabricator.wikimedia.org/T395495) (owner: 10Btullis) [18:08:30] Thanks! Pressing the train buton [18:08:45] (03PS1) 10TrainBranchBot: group1 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151770 (https://phabricator.wikimedia.org/T392173) [18:08:46] (03CR) 10TrainBranchBot: [C:03+2] group1 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151770 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [18:09:33] (03Merged) 10jenkins-bot: group1 to 1.45.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151770 (https://phabricator.wikimedia.org/T392173) (owner: 10TrainBranchBot) [18:13:11] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:13:54] (03PS1) 10Scott French: Revert^2 "deployment_server: deploy the mediawiki-dumps-legacy scap target" [puppet] - 10https://gerrit.wikimedia.org/r/1151771 (https://phabricator.wikimedia.org/T389786) [18:14:45] (03CR) 10Scott French: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151771 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [18:19:05] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group1 to 1.45.0-wmf.3 refs T392173 [18:19:14] T392173: 1.45.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T392173 [18:21:39] (03CR) 10Bking: [C:03+2] apifeatureusage: switch to Observability-maintained curator, part 2 [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [18:21:54] (03CR) 10Bking: [C:03+2] "self-merging, as puppet is broken on these hosts and the blast radius is very low." [puppet] - 10https://gerrit.wikimedia.org/r/1151754 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [18:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 23.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.51% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:28:11] (03CR) 10Dzahn: [V:04-1 C:04-1] "the erb template from which the replication.config has code like:" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:29:00] (03CR) 10AOkoth: [C:03+2] wmnet: map os-reports to aux ingress [dns] - 10https://gerrit.wikimedia.org/r/1148379 (https://phabricator.wikimedia.org/T350794) (owner: 10AOkoth) [18:30:12] !log aokoth@dns1004 START - running authdns-update [18:31:00] !log aokoth@dns1004 END - running authdns-update [18:31:13] (03PS6) 10Dzahn: gerrit: add a second replica, start replicating to gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) [18:32:51] (03CR) 10Scott French: "This should now be unblocked thanks to Ben's permissions fix. Let me know if you'd like me to coordinate with either of you before a secon" [puppet] - 10https://gerrit.wikimedia.org/r/1151771 (https://phabricator.wikimedia.org/T389786) (owner: 10Scott French) [18:33:59] (03CR) 10Dzahn: [C:03+2] "that's already deleted now, merged right after" [puppet] - 10https://gerrit.wikimedia.org/r/1137485 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [18:35:18] (03CR) 10Dzahn: [V:03+1] "it works like this. compiler output: https://puppet-compiler.wmflabs.org/output/1140520/5709/gerrit1003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1140520 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [18:37:04] (03CR) 10Vgutierrez: [C:03+1] "if it's enough for debugging purposes I'm OK with that" [puppet] - 10https://gerrit.wikimedia.org/r/1147884 (https://phabricator.wikimedia.org/T373550) (owner: 10BCornwall) [18:37:15] (03PS3) 10Dzahn: aptrepo: add thirdparty/ci component to bookworm-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/1137361 (https://phabricator.wikimedia.org/T392127) [18:37:39] (03CR) 10Ssingh: "Thanks and sorry for not checking!" [puppet] - 10https://gerrit.wikimedia.org/r/1137485 (https://phabricator.wikimedia.org/T385777) (owner: 10Dzahn) [18:39:53] (03CR) 10Dzahn: [C:03+2] doc: add support for PHP version bookworm, drop php_prefix variable [puppet] - 10https://gerrit.wikimedia.org/r/1151757 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [18:41:17] btullis@cumin1002 reimage (PID 325031) is awaiting input [18:43:11] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [18:49:21] PROBLEM - nova-compute proc minimum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:29] PROBLEM - nova-compute proc minimum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:33] PROBLEM - nova-compute proc minimum on cloudvirt1043 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:34] PROBLEM - nova-compute proc minimum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:34] PROBLEM - nova-compute proc minimum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:35] PROBLEM - nova-compute proc minimum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:36] PROBLEM - nova-compute proc minimum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:39] PROBLEM - nova-compute proc minimum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:46] (03CR) 10Dzahn: [C:03+2] "this fixed the (first, after reimage) puppet run on doc1004. it was noop on other doc hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1151757 (https://phabricator.wikimedia.org/T392130) (owner: 10Dzahn) [18:49:49] PROBLEM - nova-compute proc minimum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:49] PROBLEM - nova-compute proc minimum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:50] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:51] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:49:52] PROBLEM - nova-compute proc minimum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:03] PROBLEM - nova-compute proc minimum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:04] PROBLEM - nova-compute proc minimum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:04] PROBLEM - nova-compute proc minimum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:05] PROBLEM - nova-compute proc minimum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:07] PROBLEM - nova-compute proc minimum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:09] PROBLEM - nova-compute proc minimum on cloudvirtlocal1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:09] PROBLEM - nova-compute proc minimum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:10] PROBLEM - nova-compute proc minimum on cloudvirt1040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:11] PROBLEM - nova-compute proc minimum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:12] PROBLEM - nova-compute proc minimum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:12] that's fun [18:50:13] PROBLEM - nova-compute proc minimum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:14] PROBLEM - nova-compute proc minimum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:50:25] lol [18:50:47] I don't see anything obvious that might have trigerred this [18:50:57] in puppet or SAL [18:52:09] RECOVERY - nova-compute proc minimum on cloudvirtlocal1002 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:52:23] (03PS1) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) [18:52:45] (03CR) 10CI reject: [V:04-1] apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [18:52:51] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:52:55] (03PS2) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) [18:53:09] RECOVERY - nova-compute proc minimum on cloudvirt1040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:16] (03CR) 10CI reject: [V:04-1] apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [18:53:23] PROBLEM - nova-compute proc maximum on cloudvirt1059 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:29] PROBLEM - nova-compute proc maximum on cloudvirt1041 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:29] PROBLEM - nova-compute proc maximum on cloudvirt1054 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:30] PROBLEM - nova-compute proc maximum on cloudvirt1067 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:32] PROBLEM - nova-compute proc maximum on cloudvirtlocal1001 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:33] PROBLEM - nova-compute proc maximum on cloudvirt1053 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:34] RECOVERY - nova-compute proc minimum on cloudvirt1043 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:34] PROBLEM - nova-compute proc maximum on cloudvirt1056 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:49] PROBLEM - nova-compute proc maximum on cloudvirtlocal1003 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:53:49] PROBLEM - nova-compute proc maximum on cloudvirt1044 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:01] PROBLEM - nova-compute proc maximum on cloudvirt1052 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:03] PROBLEM - nova-compute proc maximum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:04] PROBLEM - nova-compute proc maximum on cloudvirt1047 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:04] PROBLEM - nova-compute proc maximum on cloudvirt1051 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:05] PROBLEM - nova-compute proc maximum on cloudvirt1050 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:09] PROBLEM - nova-compute proc maximum on cloudvirt1061 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:09] PROBLEM - nova-compute proc maximum on cloudvirt1057 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:10] PROBLEM - nova-compute proc maximum on cloudvirt1049 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:11] PROBLEM - nova-compute proc maximum on cloudvirt1058 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:12] PROBLEM - nova-compute proc maximum on cloudvirt1065 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:13] PROBLEM - nova-compute proc maximum on cloudvirt1062 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:54:14] (03PS3) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) [18:54:17] PROBLEM - nova-compute proc maximum on cloudvirt1042 is CRITICAL: PROCS CRITICAL: 0 processes with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:55:38] Apparently the nova-compute issues are andrewbogott and are known [18:56:16] Yep, doing a very wide upgrade, they'll clear gradually. [18:56:27] They're also all ack'd and resolved in victor ops, not sure why that doesn't show up here. [18:56:51] PROBLEM - nova-compute proc minimum on cloudvirt1046 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:22] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [18:57:35] RECOVERY - nova-compute proc minimum on cloudvirt1050 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:36] RECOVERY - nova-compute proc minimum on cloudvirt1058 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:39] RECOVERY - nova-compute proc minimum on cloudvirt1059 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:49] RECOVERY - nova-compute proc minimum on cloudvirt1053 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:49] RECOVERY - nova-compute proc maximum on cloudvirt1044 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:50] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:51] RECOVERY - nova-compute proc minimum on cloudvirt1052 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:52] RECOVERY - nova-compute proc minimum on cloudvirt1046 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:57:53] RECOVERY - nova-compute proc minimum on cloudvirt1067 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:01] RECOVERY - nova-compute proc maximum on cloudvirt1052 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:03] RECOVERY - nova-compute proc maximum on cloudvirt1047 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:04] RECOVERY - nova-compute proc maximum on cloudvirt1051 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:04] RECOVERY - nova-compute proc minimum on cloudvirt1041 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:05] RECOVERY - nova-compute proc maximum on cloudvirt1048 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:06] RECOVERY - nova-compute proc minimum on cloudvirt1051 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:07] RECOVERY - nova-compute proc minimum on cloudvirt1057 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:08] RECOVERY - nova-compute proc maximum on cloudvirt1050 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:09] RECOVERY - nova-compute proc minimum on cloudvirt1042 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:10] RECOVERY - nova-compute proc minimum on cloudvirt1044 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:11] RECOVERY - nova-compute proc maximum on cloudvirt1057 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:12] RECOVERY - nova-compute proc maximum on cloudvirt1061 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:13] RECOVERY - nova-compute proc minimum on cloudvirt1054 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:14] RECOVERY - nova-compute proc maximum on cloudvirt1049 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:15] RECOVERY - nova-compute proc maximum on cloudvirt1058 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:16] RECOVERY - nova-compute proc minimum on cloudvirt1065 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:17] RECOVERY - nova-compute proc maximum on cloudvirt1062 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:18] RECOVERY - nova-compute proc maximum on cloudvirt1065 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:19] RECOVERY - nova-compute proc minimum on cloudvirt1062 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:20] RECOVERY - nova-compute proc maximum on cloudvirt1042 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:21] RECOVERY - nova-compute proc minimum on cloudvirt1049 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:23] RECOVERY - nova-compute proc maximum on cloudvirt1059 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:29] RECOVERY - nova-compute proc maximum on cloudvirt1041 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:29] RECOVERY - nova-compute proc minimum on cloudvirt1061 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:30] RECOVERY - nova-compute proc maximum on cloudvirt1054 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:31] RECOVERY - nova-compute proc maximum on cloudvirt1067 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:34] RECOVERY - nova-compute proc maximum on cloudvirt1053 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:34] RECOVERY - nova-compute proc minimum on cloudvirt1047 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:34] RECOVERY - nova-compute proc minimum on cloudvirt1056 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [18:58:35] RECOVERY - nova-compute proc maximum on cloudvirt1056 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:08] RECOVERY - nova-compute proc minimum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:17] (03PS4) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) [19:02:32] RECOVERY - nova-compute proc maximum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:02:39] (03CR) 10CI reject: [V:04-1] apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [19:02:48] RECOVERY - nova-compute proc maximum on cloudvirtlocal1003 is OK: PROCS OK: 1 process with PPID = 1, regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:04] RECOVERY - nova-compute proc minimum on cloudvirtlocal1001 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:03:08] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Swiftly running write operations on files can result in the MediaWiki DB getting out of sync with Swift, resulting in "A non-identical file already exists at
errors" on undelete - https://phabricator.wikimedia.org/T387340#10865523 (10Ladsg... [19:03:11] 10SRE-swift-storage, 06Commons, 10MediaWiki-File-management: Swiftly running write operations on files can result in the MediaWiki DB getting out of sync with Swift, resulting in "A non-identical file already exists at
errors" on undelete - https://phabricator.wikimedia.org/T387340#10865525 (10Ladsg... [19:03:24] (03PS5) 10Bking: apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) [19:06:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [19:09:12] (03CR) 10Bking: [C:03+2] apifeatureusage: switch to Observability-maintained curator, part 3 [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [19:09:35] (03CR) 10Bking: [C:03+2] "self-merging, as Puppet is broken and the blast radius is low." [puppet] - 10https://gerrit.wikimedia.org/r/1151775 (https://phabricator.wikimedia.org/T394742) (owner: 10Bking) [19:09:48] PROBLEM - nova-compute proc minimum on cloudvirt1048 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:10:48] RECOVERY - nova-compute proc minimum on cloudvirt1048 is OK: PROCS OK: 1 process with regex args ^/usr/bin/pytho[n].* /usr/bin/nova-compute https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [19:12:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [19:14:43] (03CR) 10Umherirrender: "I am not sure about deployment. Just adding it to https://wikitech.wikimedia.org/wiki/Deployments?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [19:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:16:48] RESOLVED: PuppetFailure: Puppet has failed on apifeatureusage1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:18:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [19:18:34] RESOLVED: SystemdUnitFailed: curator_actions_apifeatureusage_eqiad.service on apifeatureusage1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:23:48] (03PS1) 10Jforrester: build: Rename the rarely-used 'typos' script to 'checkTypos' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151781 [19:26:54] (03PS2) 10Bvibber: Enable Lua transform switch for Charts on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151747 (https://phabricator.wikimedia.org/T388616) [19:39:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, May 28 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [19:41:20] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10865655 (10wiki_willy) I just filled out the registration for the seed server today, so it should be arriving in the next 1-2 weeks [19:41:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10865668 (10VRiley-WMF) 05Open→03In progress Running the decom cookbook on lvs1016 soon [19:42:01] (03PS1) 10Bking: elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) [19:42:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [19:42:17] !log aokoth@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10 days, 0:00:00 on doc1003.eqiad.wmnet with reason: Bookworm Migration [19:43:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [19:43:19] (03PS2) 10Bking: elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) [19:43:45] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10865682 (10Marostegui) This is great news! Thank you! [19:45:59] (03PS3) 10Bking: elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) [19:46:52] (03PS4) 10Bking: elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) [19:46:57] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [19:54:35] (03PS1) 10Eevans: cassandra-dev2003: configure instances for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1151785 (https://phabricator.wikimedia.org/T391544) [19:56:29] (03CR) 10Eevans: [C:03+2] cassandra-dev2003: configure instances for JBOD [puppet] - 10https://gerrit.wikimedia.org/r/1151785 (https://phabricator.wikimedia.org/T391544) (owner: 10Eevans) [19:57:58] (03PS3) 10Bvibber: Enable Lua transform switch for Charts on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151747 (https://phabricator.wikimedia.org/T388616) [19:59:41] !log eevans@cumin1002 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS bullseye [19:59:54] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 13Patch-For-Review, and 2 others: Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10865712 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1002 for host cassandra-dev2003.... [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T2000). [20:00:05] grey-olson, tzatziki, and arlolra: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:19] Here! [20:00:51] present :) [20:01:07] o/ mind if I try spiderpig for the first config change? (on behalf of grey-olson) [20:02:12] here [20:02:27] (03PS4) 10Bvibber: Enable Lua transform switch for Charts on test and beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151747 (https://phabricator.wikimedia.org/T395516) [20:03:15] arlola: Any objection? [20:03:31] Go for it [20:03:42] dbrant: have at it [20:03:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dbrant@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151299 (owner: 10GOlson) [20:04:04] I was also looking to try spiderpig with my patch [20:05:00] (03Merged) 10jenkins-bot: App Interaction:: Add Tabs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151299 (owner: 10GOlson) [20:05:25] !log dbrant@deploy1003 Started scap sync-world: Backport for [[gerrit:1151299|App Interaction:: Add Tabs]] [20:06:01] (03PS1) 10Bvibber: Lua transform backend for JsonConfig Data: pages [extensions/JsonConfig] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151787 (https://phabricator.wikimedia.org/T388434) [20:07:04] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Remove references to role(spare::system) in "Server decommission request" phab template - https://phabricator.wikimedia.org/T395517 (10bking) 03NEW [20:07:15] (03PS1) 10Bvibber: Chart-side support for Lua transforms [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151788 (https://phabricator.wikimedia.org/T388616) [20:07:39] !log dbrant@deploy1003 dbrant, golson-wmf: Backport for [[gerrit:1151299|App Interaction:: Add Tabs]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:07:51] (03PS5) 10Bking: elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) [20:08:23] grey-olson: want to verify on debug? [20:09:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10865766 (10VRiley-WMF) lvs1016 new location info A7 U27 CableID: 5081 port 27 [20:09:44] dbrant LGTM! [20:09:52] !log dbrant@deploy1003 dbrant, golson-wmf: Continuing with sync [20:12:21] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [20:13:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:13:23] (03CR) 10Ahmon Dancy: [C:03+1] deployment:fix-staging-perm: update fix-staging-perms [puppet] - 10https://gerrit.wikimedia.org/r/1151753 (https://phabricator.wikimedia.org/T276994) (owner: 10Effie Mouzeli) [20:13:53] (50%) [20:15:25] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [20:17:03] !log dbrant@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151299|App Interaction:: Add Tabs]] (duration: 11m 38s) [20:17:53] alright! spiderpig ftw. Will hand it off to someone else to proceed :) [20:18:10] You're still expected to coordinate with others here. [20:18:49] So arlolra, you're up [20:19:02] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [20:19:07] Ok, on it [20:19:23] whoops, roger that [20:19:33] Oh [20:19:53] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10865796 (10VRiley-WMF) As per instructions, the following has been completed. 1. Run decom cookbook for lvs1016 2. Physically move lvs1016 to rack A7 3. Connect lvs1016 primary 10G por... [20:20:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [20:20:48] (03Merged) 10jenkins-bot: Remove $wgParserEnableLegacyMediaDOM option [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148950 (https://phabricator.wikimedia.org/T394054) (owner: 10Arlolra) [20:21:10] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1148950|Remove $wgParserEnableLegacyMediaDOM option (T394054)]] [20:21:15] T394054: Remove $wgParserEnableLegacyMediaDOM option to disable new media HTML - https://phabricator.wikimedia.org/T394054 [20:22:21] (03CR) 10Aleksandar Mastilovic: [C:03+1] airflow: Stop the airflow services on an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/1151769 (https://phabricator.wikimedia.org/T395495) (owner: 10Btullis) [20:23:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:23:21] !log arlolra@deploy1003 arlolra: Backport for [[gerrit:1148950|Remove $wgParserEnableLegacyMediaDOM option (T394054)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:24:47] !log arlolra@deploy1003 arlolra: Continuing with sync [20:25:02] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10Phabricator: Remove references to role(spare::system) in "Server decommission request" phab template - https://phabricator.wikimedia.org/T395517#10865807 (10A_smart_kitten) [20:28:26] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations, 10Phabricator: Remove references to role(spare::system) in "Server decommission request" phab template - https://phabricator.wikimedia.org/T395517#10865820 (10RobH) 05Open→03Resolved a:03RobH Removed that line entirely, now its just: [] - any... [20:31:39] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1148950|Remove $wgParserEnableLegacyMediaDOM option (T394054)]] (duration: 10m 28s) [20:31:44] T394054: Remove $wgParserEnableLegacyMediaDOM option to disable new media HTML - https://phabricator.wikimedia.org/T394054 [20:33:25] (03PS2) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [20:33:30] tzatziki: Did you want me to try deploying your patches as well? [20:33:44] arlolra: if you're able! they are already merged [20:34:19] (I very rarely do deploys etc so apologies if I screw up the lingo :) ) [20:34:25] (03CR) 10Andrea Denisse: "Excellent catch, thank you! I've sent a new patch including a logrotate rule." [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [20:34:40] dancy: ok if do them? [20:34:43] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T395518 (10phaultfinder) 03NEW [20:34:55] Absolutely! [20:35:08] I can do them both at once, right [20:35:46] Yes [20:36:29] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1151740 (https://phabricator.wikimedia.org/T395386) (owner: 10Foks) [20:36:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by arlolra@deploy1003 using scap backport" [extensions/SecurePoll] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151741 (https://phabricator.wikimedia.org/T395386) (owner: 10Foks) [20:37:57] (03Merged) 10jenkins-bot: SecurePoll: Adding files for U4C vote 2025 [extensions/SecurePoll] (wmf/1.45.0-wmf.2) - 10https://gerrit.wikimedia.org/r/1151740 (https://phabricator.wikimedia.org/T395386) (owner: 10Foks) [20:37:58] (03Merged) 10jenkins-bot: SecurePoll: Adding files for U4C vote 2025 [extensions/SecurePoll] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151741 (https://phabricator.wikimedia.org/T395386) (owner: 10Foks) [20:38:23] !log arlolra@deploy1003 Started scap sync-world: Backport for [[gerrit:1151740|SecurePoll: Adding files for U4C vote 2025 (T395386)]], [[gerrit:1151741|SecurePoll: Adding files for U4C vote 2025 (T395386)]] [20:38:28] T395386: Create and run 2025 U4C election - https://phabricator.wikimedia.org/T395386 [20:40:31] !log arlolra@deploy1003 foks, arlolra: Backport for [[gerrit:1151740|SecurePoll: Adding files for U4C vote 2025 (T395386)]], [[gerrit:1151741|SecurePoll: Adding files for U4C vote 2025 (T395386)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:41:12] tzatziki: Do you want to test? [20:41:40] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS bullseye [20:41:48] 06SRE-OnFire, 10Cassandra, 10MediaWiki-Platform-Team (Radar), 07Security, 10Sustainability (Incident Followup): Increase sessionstore storage capacity - https://phabricator.wikimedia.org/T391544#10865864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1002 for host cass... [20:42:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [20:42:48] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 209828648 and 26 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:43:48] (03PS3) 10Andrea Denisse: centrallog: Add a temporary rsyslog debug config file [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) [20:44:16] tzatziki: hello? [20:44:25] arlolra: sorry, bathroom :) [20:44:30] will look [20:44:34] ah [20:44:41] thanks [20:44:47] I see the files on the server, so I think we are all good [20:44:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 109224 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:46:12] afk [20:46:42] !log arlolra@deploy1003 foks, arlolra: Continuing with sync [20:47:02] thanks! [20:50:48] (03PS1) 10Cathal Mooney: New function to generate device-specific IBGP data from cluster YAML [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1151793 (https://phabricator.wikimedia.org/T394530) [20:53:41] !log arlolra@deploy1003 Finished scap sync-world: Backport for [[gerrit:1151740|SecurePoll: Adding files for U4C vote 2025 (T395386)]], [[gerrit:1151741|SecurePoll: Adding files for U4C vote 2025 (T395386)]] (duration: 15m 18s) [20:53:46] T395386: Create and run 2025 U4C election - https://phabricator.wikimedia.org/T395386 [20:54:23] Guess we're all done with the backport window [20:55:36] Awesome :) [20:56:12] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1151386/5710/" [puppet] - 10https://gerrit.wikimedia.org/r/1151386 (https://phabricator.wikimedia.org/T383309) (owner: 10Andrea Denisse) [21:00:05] Deploy window Wikifunctions Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T2100) [21:08:00] (03CR) 10Ryan Kemper: [C:03+1] elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [21:08:41] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: prepare hosts for decom [puppet] - 10https://gerrit.wikimedia.org/r/1151784 (https://phabricator.wikimedia.org/T394350) (owner: 10Bking) [21:09:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:09:13] (03CR) 10Ryan Kemper: [C:03+2] wdqs: nuke previously absented pyrra update lag [puppet] - 10https://gerrit.wikimedia.org/r/1148979 (https://phabricator.wikimedia.org/T393966) (owner: 10Ryan Kemper) [21:13:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:17:42] (03PS1) 10Ryan Kemper: sre.elasticsearch: remove unused cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1151797 (https://phabricator.wikimedia.org/T261239) [21:22:33] (03CR) 10Ryan Kemper: [C:03+2] "Addressed broken cookbooks in https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1151797" [cookbooks] - 10https://gerrit.wikimedia.org/r/603731 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [21:23:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.05.24 - 2025.06.13): Q2:rack/setup/install an-worker11[78-86] - https://phabricator.wikimedia.org/T377878#10865953 (10VRiley-WMF) Re-racked an-worker1185 to F8 U3 cableID 20220264 Port 14 @Jclark-ctr by any chance, would you be able to take a look... [21:24:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:26:37] (03CR) 10Bking: [C:03+1] sre.elasticsearch: remove unused cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1151797 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [21:27:21] (03CR) 10Ryan Kemper: [C:03+2] sre.elasticsearch: remove unused cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1151797 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [21:27:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [21:33:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:33:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1061-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:33:41] (03Merged) 10jenkins-bot: sre.elasticsearch: remove unused cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1151797 (https://phabricator.wikimedia.org/T261239) (owner: 10Ryan Kemper) [21:37:03] !log ryankemper@cumin2002 START - Cookbook sre.hosts.decommission for hosts elastic1054.eqiad.wmnet [21:42:12] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [21:42:44] !log ryankemper@cumin2002 START - Cookbook sre.dns.netbox [21:43:37] !log bking@cumin2002 START - Cookbook sre.hosts.decommission for hosts cirrussearch1053.eqiad.wmnet [21:48:20] ryankemper@cumin2002 decommission (PID 3982565) is awaiting input [21:48:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch1061-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [21:48:59] !log ryankemper@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1054.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [21:51:18] !log bking@cumin2002 START - Cookbook sre.dns.netbox [21:51:38] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: elastic1054.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin2002" [21:51:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:51:39] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts elastic1054.eqiad.wmnet [21:53:59] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:54:00] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cirrussearch1053.eqiad.wmnet [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250528T2200) [22:03:34] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [22:09:34] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [22:12:34] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [22:18:45] FIRING: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [22:27:45] (03PS1) 10Jasmine: wikikube: decommission wikikube-worker103[23].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151808 (https://phabricator.wikimedia.org/T383227) [22:43:25] FIRING: SystemdUnitFailed: update-tails-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:51:16] (03PS4) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) [22:52:28] (03CR) 10Dzahn: mediawiki/apache: redirect tj.*.org to tg.*.org for all projects (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1148981 (https://phabricator.wikimedia.org/T393803) (owner: 10Dzahn) [22:55:07] (03Abandoned) 10Dzahn: lists: add parameter and code to block abusers using nftables [puppet] - 10https://gerrit.wikimedia.org/r/1148433 (https://phabricator.wikimedia.org/T394519) (owner: 10Dzahn) [22:56:22] (03PS2) 10Jasmine: wikikube: decommission wikikube-worker103[23].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1151808 (https://phabricator.wikimedia.org/T383227) [22:59:31] !log brennen@deploy1003 Started deploy [phabricator/deployment@99aa712]: test deploy to phab1005 for T377889 [22:59:36] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [23:03:46] !log brennen@deploy1003 deploy aborted: test deploy to phab1005 for T377889 (duration: 04m 14s) [23:03:59] !log brennen@deploy1003 Started deploy [phabricator/deployment@99aa712]: test deploy to phab1005 for T377889 [23:04:57] Currently deploying a security patch BTW that was not applied to the newer versions for some reason [23:05:18] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [23:07:16] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29763 bytes in 7.581 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [23:08:34] PROBLEM - MariaDB memory on es1035 is CRITICAL: CRIT Memory 95% used. Largest process: mysqld (1614) = 93.6% https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [23:08:37] !log brennen@deploy1003 Finished deploy [phabricator/deployment@99aa712]: test deploy to phab1005 for T377889 (duration: 04m 38s) [23:08:42] T377889: install a service on phab1005 - https://phabricator.wikimedia.org/T377889 [23:10:10] !log dreamyjazz Deployed security patch for T394693 [23:10:15] T394693: Special:CheckUser has i18n XSS vectors - https://phabricator.wikimedia.org/T394693 [23:11:18] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [23:14:24] (03PS1) 10Jdlrobson: Enable Minerva typeahead search on beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151812 (https://phabricator.wikimedia.org/T380510) [23:15:12] FIRING: JobUnavailable: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:20:08] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 29762 bytes in 1.065 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [23:22:22] !log dreamyjazz Deployed security patch for T394693 [23:22:27] T394693: Special:CheckUser has i18n XSS vectors - https://phabricator.wikimedia.org/T394693 [23:32:42] (03PS1) 10Jdlrobson: Fixes issues with recommendations config in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1151813 (https://phabricator.wikimedia.org/T393943) [23:33:44] RESOLVED: SwiftLowObjectAvailability: Swift eqiad object availability low - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=8&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftLowObjectAvailability [23:37:07] (03CR) 10Jdlrobson: [C:03+1] Chart-side support for Lua transforms [extensions/Chart] (wmf/1.45.0-wmf.3) - 10https://gerrit.wikimedia.org/r/1151788 (https://phabricator.wikimedia.org/T388616) (owner: 10Bvibber) [23:38:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1151814 [23:38:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1151814 (owner: 10TrainBranchBot) [23:44:04] !log dreamyjazz Deployed security patch for T394700 [23:44:21] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [23:47:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [23:49:32] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2003.codfw.wmnet with OS bookworm [23:49:36] 10ops-codfw, 06SRE, 06DC-Ops: Q4:rack/setup/install Dell Config H 1P Test Host - https://phabricator.wikimedia.org/T393042#10866269 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sretest2003.codfw.wmnet with OS bookworm executed with errors: - sretest2003 (... [23:50:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1151814 (owner: 10TrainBranchBot) [23:52:45] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [23:53:23] !log dreamyjazz Deployed security patch for T394700