[00:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0000) [00:00:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [00:03:01] (03CR) 10Aklapper: [V:03+2 C:03+2] Replace backtick operator with shell_exec [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1218364 (owner: 10Pppery) [00:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [00:10:19] (03PS1) 10Dzahn: Revert "deployment::server: allow releases hosts encrypted rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1218383 [00:12:19] (03CR) 10Dzahn: [C:03+2] Revert "deployment::server: allow releases hosts encrypted rsync" [puppet] - 10https://gerrit.wikimedia.org/r/1218383 (owner: 10Dzahn) [00:22:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 18.84% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:27:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at codfw: 23.5% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [00:33:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:38:50] (03CR) 10RLazarus: [C:03+2] {api,rest}-gateway: Update to Envoy 1.35.7 in production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217611 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [00:40:06] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 [00:40:06] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 (owner: 10TrainBranchBot) [00:41:57] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/api-gateway: apply [00:42:13] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/api-gateway: apply [00:43:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:43:52] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/api-gateway: apply [00:44:16] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/api-gateway: apply [00:45:54] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/api-gateway: apply [00:46:17] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/api-gateway: apply [00:48:17] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:48:52] !log rzl@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [00:49:08] !log rzl@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [00:50:15] !log rzl@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [00:50:30] !log rzl@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [00:52:49] !log rzl@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [00:52:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1218385 (owner: 10TrainBranchBot) [00:53:01] !log rzl@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [01:00:39] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:01:54] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 01m 15s) [01:10:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 [01:10:05] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 (owner: 10TrainBranchBot) [01:10:07] 06SRE, 06collaboration-services, 10MW-on-K8s, 06serviceops: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858#11462618 (10Dzahn) 05Open→03Resolved file transfers to and between releases servers are now encrypted [01:11:27] 06SRE, 10Wikimedia-Mailing-lists: Request for mailing list - Wiki Debates - https://phabricator.wikimedia.org/T412017#11462622 (10Dzahn) Hi @Gnangarra what do you think? Do you just want to take over the existing Wikidebate list? [01:13:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 35357912 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:14:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3199168 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:31:13] PROBLEM - dump of matomo in eqiad on backupmon1001 is CRITICAL: Last dump for matomo at eqiad (db1208) taken on 2025-12-16 01:07:57 is 436 MiB, but the previous one was 537 MiB, a change of -18.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:34:40] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1218390 (owner: 10TrainBranchBot) [01:55:36] (03CR) 10Ladsgroup: [C:03+2] SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:06:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86634 and previous config saved to /var/cache/conftool/dbconfig/20251216-020611-marostegui.json [02:06:17] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:06:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:07:29] (03Merged) 10jenkins-bot: SpecialLinkSearch: Add a message when domains are being ignored [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218295 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:07:58] (03CR) 10Ladsgroup: [C:03+2] SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:09:33] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) [02:09:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [02:09:35] (03CR) 10Ladsgroup: "With the cherry-pick, it doesn't move the message, it adds to to bottom too :/" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:09:45] (03Abandoned) 10Ladsgroup: SpecialLinkSearch: Move ignored-domains msg to bottom [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218314 (https://phabricator.wikimedia.org/T405005) (owner: 10Ladsgroup) [02:11:20] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] [02:11:24] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [02:21:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P86635 and previous config saved to /var/cache/conftool/dbconfig/20251216-022119-marostegui.json [02:21:28] (03PS1) 10Clare Ming: Update references to `product_metrics` to `test_kitchen` [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) [02:22:16] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [02:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [02:36:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P86636 and previous config saved to /var/cache/conftool/dbconfig/20251216-023627-marostegui.json [02:36:39] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [02:36:44] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [02:37:31] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [02:50:07] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218295|SpecialLinkSearch: Add a message when domains are being ignored (T405005)]] (duration: 38m 47s) [02:50:11] T405005: Implement mechanism to exclude a domain from externallinks database (LinkSearch) - https://phabricator.wikimedia.org/T405005 [02:51:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86637 and previous config saved to /var/cache/conftool/dbconfig/20251216-025136-marostegui.json [02:51:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:51:44] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:51:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2168.codfw.wmnet with reason: Maintenance [02:52:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86638 and previous config saved to /var/cache/conftool/dbconfig/20251216-025200-marostegui.json [02:53:07] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 319283440 and 31 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:54:07] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 23752 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0300) [03:02:03] (03CR) 10Clare Ming: [C:04-2] "need to wait until https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/226 propagates everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1216847 (https://phabricator.wikimedia.org/T407806) (owner: 10Clare Ming) [03:18:15] PROBLEM - Host an-druid1005 is DOWN: PING CRITICAL - Packet loss = 20%, RTA = 2746.82 ms [03:18:55] RECOVERY - Host an-druid1005 is UP: PING OK - Packet loss = 0%, RTA = 21.67 ms [03:39:31] (03CR) 10Clare Ming: "not sure if we want to update stream names with `product_metrics` in them or not" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218395 (https://phabricator.wikimedia.org/T407906) (owner: 10Clare Ming) [03:50:46] FIRING: [2x] Outbound discards: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:00:05] Deploy window Automatic deployment of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous deployment/Train deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0400) [04:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [04:15:46] RESOLVED: Outbound discards: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [04:47:10] FIRING: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:48:32] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:10] RESOLVED: BFDdown: BFD session down between cr1-codfw and fe80::5e5e:abff:fe3d:8198 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr1-codfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [05:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0500) [05:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:15:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1174.eqiad.wmnet with reason: Maintenance [05:16:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86639 and previous config saved to /var/cache/conftool/dbconfig/20251216-051607-marostegui.json [05:16:13] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [05:16:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [05:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [06:35:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86640 and previous config saved to /var/cache/conftool/dbconfig/20251216-063525-marostegui.json [06:35:31] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:35:32] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:50:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P86641 and previous config saved to /var/cache/conftool/dbconfig/20251216-065033-marostegui.json [06:55:15] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:55:39] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:58:33] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0700) [07:00:04] marostegui, Amir1, and federico3: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0700). [07:05:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174', diff saved to https://phabricator.wikimedia.org/P86642 and previous config saved to /var/cache/conftool/dbconfig/20251216-070542-marostegui.json [07:10:13] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Yubikey-SSH-FIDO for cdobbins - https://phabricator.wikimedia.org/T412755#11462838 (10Marostegui) p:05Triage→03Medium a:03CDobbins I assume you'd take care of this yourself? If you need help from Clinic Duty person let me know! [07:18:24] (03PS1) 10Marostegui: isntallserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1218652 (https://phabricator.wikimedia.org/T407472) [07:20:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1174 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86643 and previous config saved to /var/cache/conftool/dbconfig/20251216-072049-marostegui.json [07:20:55] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [07:20:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [07:21:05] (03CR) 10Marostegui: [C:03+2] isntallserver: Do not format /srv on es2028 [puppet] - 10https://gerrit.wikimedia.org/r/1218652 (https://phabricator.wikimedia.org/T407472) (owner: 10Marostegui) [07:21:06] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1181.eqiad.wmnet with reason: Maintenance [07:21:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86644 and previous config saved to /var/cache/conftool/dbconfig/20251216-072114-marostegui.json [07:22:36] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11462865 (10ABran-WMF) a:03ABran-WMF [07:27:55] (03Abandoned) 10Ayounsi: interface::alias: update define to get prefix len from netmask [puppet] - 10https://gerrit.wikimedia.org/r/931237 (https://phabricator.wikimedia.org/T336864) (owner: 10Jbond) [07:30:27] (03CR) 10Ayounsi: [C:03+1] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [07:33:08] (03CR) 10Ayounsi: [C:03+2] Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi) [07:34:58] (03Merged) 10jenkins-bot: Netbox scripts: remove the scheduling UI [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218210 (owner: 10Ayounsi) [07:36:54] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:37:24] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:47:28] (03CR) 10Itamar Givon: [C:03+1] Use relative path for "latest" symlinks [dumps] - 10https://gerrit.wikimedia.org/r/1218317 (https://phabricator.wikimedia.org/T412726) (owner: 10Jakob) [07:59:43] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1215549 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:00:05] Amir1, Urbanecm, and awight: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T0800). [08:00:05] hamishcz and akosiaris: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:02:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86645 and previous config saved to /var/cache/conftool/dbconfig/20251216-080227-marostegui.json [08:02:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:02:34] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:09:53] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1218704 (https://phabricator.wikimedia.org/T365798) [08:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [08:11:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS bookworm [08:17:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P86646 and previous config saved to /var/cache/conftool/dbconfig/20251216-081735-marostegui.json [08:22:41] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster2002 [puppet] - 10https://gerrit.wikimedia.org/r/1218704 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:27:56] (03CR) 10Ayounsi: [C:03+2] interface::alias: add optional is_service_ip param [puppet] - 10https://gerrit.wikimedia.org/r/618766 (owner: 10Volans) [08:29:37] (03PS2) 10Muehlenhoff: Add Hugh as approver for mw-log-readers and logstash-roots [puppet] - 10https://gerrit.wikimedia.org/r/1214078 (https://phabricator.wikimedia.org/T276465) [08:32:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P86647 and previous config saved to /var/cache/conftool/dbconfig/20251216-083243-marostegui.json [08:33:36] (03PS8) 10Dpogorzelski: ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 [08:33:44] (03CR) 10Dpogorzelski: ml-build: add docker-pkg (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [08:37:08] (03PS1) 10Muehlenhoff: Remove puppetmaster2002 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218706 (https://phabricator.wikimedia.org/T365798) [08:39:32] (03PS1) 10Dpogorzelski: docker_registry: allow ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1218707 (https://phabricator.wikimedia.org/T412524) [08:40:55] (03PS1) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [08:41:24] (03CR) 10CI reject: [V:04-1] gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [08:42:59] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for ryankemper - https://phabricator.wikimedia.org/T412126#11462957 (10MoritzMuehlenhoff) [08:43:11] (03PS2) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [08:45:04] (03CR) 10CI reject: [V:04-1] gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [08:45:25] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218706 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [08:46:34] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [08:47:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86648 and previous config saved to /var/cache/conftool/dbconfig/20251216-084752-marostegui.json [08:47:58] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:47:59] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:48:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2182.codfw.wmnet with reason: Maintenance [08:48:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86649 and previous config saved to /var/cache/conftool/dbconfig/20251216-084817-marostegui.json [08:48:32] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:50:37] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11462983 (10ayounsi) [08:51:56] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86650 and previous config saved to /var/cache/conftool/dbconfig/20251216-085155-ladsgroup.json [08:52:00] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [08:53:28] (03PS1) 10Aqu: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) [08:55:16] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster2002.codfw.wmnet [08:58:05] (03PS1) 10Jelto: interface::alias: use wmflib::mask2cidr instead of netmask_to_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) [08:58:46] (03PS3) 10Jelto: gitlab: use real netmask in interface::alias [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [09:00:03] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463009 (10MoritzMuehlenhoff) [09:01:30] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [09:02:02] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7825/co" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:04:52] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:06:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7826/console" [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto) [09:06:03] (03PS1) 10Muehlenhoff: Remove puppetmaster2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1218711 (https://phabricator.wikimedia.org/T412783) [09:07:04] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86651 and previous config saved to /var/cache/conftool/dbconfig/20251216-090704-ladsgroup.json [09:07:57] jmm@cumin2002 decommission (PID 2673345) is awaiting input [09:12:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [09:12:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:12:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster2002.codfw.wmnet [09:12:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463027 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster2002.codfw.wmnet` - puppetmaster2002.... [09:12:53] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster2002 from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1218711 (https://phabricator.wikimedia.org/T412783) (owner: 10Muehlenhoff) [09:13:51] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2002 - https://phabricator.wikimedia.org/T412783#11463029 (10MoritzMuehlenhoff) [09:19:11] (03CR) 10Elukey: [C:03+2] team-sre: avoid cert-expiry alerts for staging endpoints [alerts] - 10https://gerrit.wikimedia.org/r/1217107 (owner: 10Elukey) [09:20:37] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "add primary IP to ps1-e10-eqiad - ayounsi@cumin1003" [09:21:17] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "add primary IP to ps1-e10-eqiad - ayounsi@cumin1003" [09:22:00] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [09:22:13] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P86652 and previous config saved to /var/cache/conftool/dbconfig/20251216-092212-ladsgroup.json [09:27:18] (03PS2) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) [09:27:28] (03CR) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [09:28:18] (03PS4) 10Jelto: gitlab: use real netmask in interface::alias on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) [09:32:05] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:32:42] (03PS2) 10Muehlenhoff: Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) [09:34:13] (03CR) 10Ayounsi: [C:03+1] "nicely done!" [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:34:37] (03CR) 10Ayounsi: [C:03+1] "lgtm! especially as it's a NOOP for now." [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto) [09:35:10] (03PS1) 10Fabfur: P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) [09:37:12] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [09:37:17] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:37:21] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T410589)', diff saved to https://phabricator.wikimedia.org/P86653 and previous config saved to /var/cache/conftool/dbconfig/20251216-093720-ladsgroup.json [09:37:25] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [09:37:38] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2209.codfw.wmnet with reason: Maintenance [09:37:46] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2209 (T410589)', diff saved to https://phabricator.wikimedia.org/P86654 and previous config saved to /var/cache/conftool/dbconfig/20251216-093745-ladsgroup.json [09:39:25] (03CR) 10Jelto: [V:03+1 C:03+2] interface::alias: use wmflib::mask2cidr instead of netmask_to_cidr [puppet] - 10https://gerrit.wikimedia.org/r/1218710 (https://phabricator.wikimedia.org/T336864) (owner: 10Jelto) [09:39:31] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: use real netmask in interface::alias on replicas [puppet] - 10https://gerrit.wikimedia.org/r/1218708 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [09:40:05] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [09:41:02] (03PS2) 10Fabfur: P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) [09:42:01] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [09:43:58] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [09:46:17] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [09:46:48] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463122 (10ops-monitoring-bot) Host gitlab2002.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new... [09:47:42] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463124 (10ayounsi) @Jhancock.wm I'll leave it to you and @RobH to procure the needed equipment. If you prefer a fiber run between the two devi... [09:50:03] RESOLVED: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:52:07] (03PS1) 10Muehlenhoff: Remove puppetmaster::backend role from puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) [09:53:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [09:54:09] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [09:55:20] (03CR) 10Hashar: [C:03+2] "The API tests job failed with:" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [09:55:21] (03PS2) 10Elukey: sre.hosts.provision: fix retry logic for the Supermicro BMC password [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) [09:55:50] (03CR) 10Elukey: "Simplified even more the code, I think that now it looks way better." [cookbooks] - 10https://gerrit.wikimedia.org/r/1218315 (https://phabricator.wikimedia.org/T412458) (owner: 10Elukey) [09:58:20] !log jelto@cumin1003 START - Cookbook sre.hosts.reboot-single for host gitlab1003.wikimedia.org [09:58:53] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463177 (10ops-monitoring-bot) Host gitlab1003.wikimedia.org rebooted by jelto@cumin1003 with reason: maintenance reboot for new... [09:59:37] (03Merged) 10jenkins-bot: Branch commit for wmf/1.46.0-wmf.7 [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218393 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [10:01:43] (03PS1) 10Tchanders: Add Special:GlobalContributions to no-IP reveal pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) [10:03:05] (03CR) 10Slyngshede: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:03:08] !log Started MediaWiki train task `train-presync`. It did not run overnight due to a CI failure | T408277 [10:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:03:12] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [10:03:45] (03PS1) 10TrainBranchBot: testwikis to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) [10:03:47] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by mwpresync@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [10:04:38] (03Merged) 10jenkins-bot: testwikis to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218720 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [10:04:45] !log jelto@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab1003.wikimedia.org [10:05:03] FIRING: [2x] JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:05:07] !log mwpresync@deploy2002 Started scap sync-world: testwikis to 1.46.0-wmf.7 refs T408277 [10:05:17] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [10:08:26] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11463212 (10Jelto) `gitlab2002` and `gitlab1003` have been fixed using the changes above. Before merging the change I manually de... [10:10:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job wmf_gitlab_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:12:06] (03PS1) 10Jelto: gitlab: use real netmask in interface::alias on all hosts [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) [10:15:03] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [10:15:47] (03CR) 10Jelto: [V:03+1 C:04-1] "merge after end of year break" [puppet] - 10https://gerrit.wikimedia.org/r/1218723 (https://phabricator.wikimedia.org/T370018) (owner: 10Jelto) [10:21:50] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetmaster::backend role from puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218716 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:24:32] (03PS4) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) [10:25:02] (03CR) 10Cathal Mooney: [C:03+1] "Cool, LGTM! If we roll it out for those hosts we can take a look and see the matches on the network. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [10:25:05] (03PS1) 10Hashar: admin: hashar: disable screen startup message [puppet] - 10https://gerrit.wikimedia.org/r/1218725 [10:25:05] (03PS1) 10Hashar: admin: hashar: prepend screen name in PS1 [puppet] - 10https://gerrit.wikimedia.org/r/1218726 [10:25:06] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: PuppetPendingCertificateRequest (instance puppetserver1001:9100) - https://phabricator.wikimedia.org/T412789 (10LSobanski) 03NEW [10:26:40] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:31:56] (03PS1) 10Muehlenhoff: puppetdb: Remove access for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218729 (https://phabricator.wikimedia.org/T365798) [10:32:11] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: mr1-codfw: add second uplink to lsw1-a2-codfw - https://phabricator.wikimedia.org/T410717#11463290 (10cmooney) >>! In T410717#11463123, @ayounsi wrote: > If a copper run is fine, then it's an SFP-T (that you probably have in stock) on... [10:32:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es2028.codfw.wmnet with reason: reimage [10:34:18] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [10:35:36] (03CR) 10Arnaudb: [C:03+2] admin: hashar: disable screen startup message [puppet] - 10https://gerrit.wikimedia.org/r/1218725 (owner: 10Hashar) [10:35:47] (03CR) 10Arnaudb: [C:03+2] admin: hashar: prepend screen name in PS1 [puppet] - 10https://gerrit.wikimedia.org/r/1218726 (owner: 10Hashar) [10:37:38] (03CR) 10Muehlenhoff: [C:03+2] puppetdb: Remove access for puppetmaster1003 [puppet] - 10https://gerrit.wikimedia.org/r/1218729 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [10:39:34] (03PS5) 10Tiziano Fogli: icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) [10:40:03] (03CR) 10Tiziano Fogli: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:42:39] (03PS1) 10Elukey: DNM - Reimage: manual stop before reboot [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 [10:44:20] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [10:44:48] jouncebot: nowandnext [10:44:48] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [10:44:48] In 0 hour(s) and 15 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100) [10:45:38] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [10:45:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) (owner: 10Dreamy Jazz) [10:45:42] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) (owner: 10Dreamy Jazz) [10:46:15] (03CR) 10Tiziano Fogli: [C:03+2] icinga/metamonitoring: disable sync_check_icinga_contacts [puppet] - 10https://gerrit.wikimedia.org/r/1218335 (https://phabricator.wikimedia.org/T393625) (owner: 10Tiziano Fogli) [10:46:17] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [10:46:37] (03Merged) 10jenkins-bot: Remove definition of wgGlobalBlockingEnableAutoblocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218342 (https://phabricator.wikimedia.org/T379086) (owner: 10Dreamy Jazz) [10:46:39] (03Merged) 10jenkins-bot: Show global autoblocks in the globalblocks list API response [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218343 (https://phabricator.wikimedia.org/T379087) (owner: 10Dreamy Jazz) [10:49:44] Scap is currently being held by "concurrent prep is locked by mwpresync on Tue Dec 16 10:05:07 2025; reason is "testwikis to 1.46.0-wmf.7 refs T408277"" [10:49:45] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [10:50:19] My understanding is that it normally doesn't take more than a few minutes to move testwikis to the new wiki version, so is there something delaying it? [10:51:29] Dreamy_Jazz: https://sal.toolforge.org/log/VL-dJpsBffdvpiTrGlEr [10:51:49] I am rerunning it yes [10:51:50] concurrent prep is locked by mwpresync (pid 1347261) on Tue Dec 16 10:05:07 2025; reason is "testwikis to 1.46.0-wmf.7 refs T408277". [10:51:50] Will wait up to 10 minute(s) for the lock(s) to be released [10:52:01] I had presumed it finished [10:52:23] (or at least it wasn't actively happening because the window seemed free) [10:52:25] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11463359 (10MoritzMuehlenhoff) [10:52:32] it takes a couple hours to run iirc [10:52:39] err [10:52:42] at least an hour [10:53:18] I have started it with `sudo /bin/systemctl start train-presync` [10:53:34] Okay. My config patches were already merged as it seems that the command above doesn't block off scap entirely [10:53:54] I presume the spiderpig job will exit and then at some point later I'll try syncing again [10:54:02] the last entry I had in the log was images being build with output being logged to /srv/mwpresync/scap-image-build-and-push-log [10:54:26] I have been tailing that file and it is at: [10:54:26] 10:09:23 [mediawiki-publish-83] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/.. [10:55:20] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=000000026&var-cluster=misc&from=now-3h&to=now&timezone=utc&viewPanel=panel-8 [10:55:37] it is pushing stuff oscillating between 3MB/s and 5MB/s [10:56:04] Yeah, thanks for the graph [10:56:35] the image was created 46 minutes ago and is 9.23GB [10:57:16] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [10:57:19] so there is some network bottleneck either out of deployment box or to ingree traffic on the image registry [10:57:59] Yeah, at the slower speed it seems about an hour using some back of the hand math [10:58:05] !log mwpresync@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/m [10:58:05] ediawiki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/media [10:58:06] wiki-staging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 52m 58s) [10:58:11] But that's presuming the file needs t be copied once [10:58:17] 10:58:05 [mediawiki-publish-83] received unexpected HTTP status: 500 Internal Server Error [10:58:17] :-( [10:58:22] :( [10:58:25] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [10:58:53] ���� ��� DOCKER [10:59:06] (03PS1) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) [10:59:08] I presume you are going to retry? [10:59:23] go ahead and backport your patch :] [10:59:25] (03CR) 10MVernon: "This looks plausible to me; when it comes to deployment, do we want to merge this on a depooled proxy first to check all is good, or are y" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [10:59:35] (03CR) 10CI reject: [V:04-1] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [10:59:36] I am going to brew a coffee and will resume the train sync once you are done [10:59:45] Okay. Backporting now. Thanks [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100) [11:00:44] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]] [11:00:49] T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086 [11:00:49] T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087 [11:01:17] (03PS2) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) [11:04:11] !log urbanecm@deploy2002 mwscript-k8s job started: GrowthExperiments:fixLinkRecommendationData --wiki=itwiki --dry-run --search-index --db-table # T412040-fix-dryrun-02 [11:04:15] T412040: Add a Link: repopulate "Add a Link" suggestions for itwiki - https://phabricator.wikimedia.org/T412040 [11:06:46] (03PS3) 10Elukey: Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) [11:07:36] * hashar grabs a coffee [11:10:23] k8s image build and push is taking longer than normal which is unexpected because my config patches did not affect i18n. I expect this is because the last push as part of the mwpresync failed? [11:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:11:41] I wonder if the same speed restrictions is being seen for this build? [11:15:29] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [11:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:18:00] (03CR) 10Marco Fossati: [C:03+1] Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia) [11:18:19] Dreamy_Jazz: oh yeah my bad sorry [11:18:29] I imagine scap might indeed attempt to push the images :/ [11:18:33] (03CR) 10Btullis: postgresql-airflow-main: Increase pgbouncer pool size (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu) [11:18:42] I am dumb I forgot :/ [11:19:01] Yeah the build-and-push-log last has an entry at 11:02 [11:19:18] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=deploy2002&var-datasource=000000026&var-cluster=misc&from=now-1h&to=now&timezone=utc&viewPanel=panel-8 [11:19:32] so yeah sorry I have passed to you the hot potatoe of pushing stuff [11:19:34] :-\ [11:19:35] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [11:19:39] Yeah, been watching that graph and seeing it do the same thing :D [11:20:01] and I could not manage to find out how to reach the logs for that `docker push` [11:20:45] It kind of feels like the maximum speed is lower than previous attempts to push [11:21:35] (03Abandoned) 10Effie Mouzeli: sre.k8s.pool-depool-node: additional check for control nodes [cookbooks] - 10https://gerrit.wikimedia.org/r/1211016 (owner: 10Effie Mouzeli) [11:22:57] https://grafana.wikimedia.org/goto/cGn-4EGDR?orgId=1 shows to me that last weeks presync went much faster (assuming that is what the activity at 04:30 is) [11:22:58] !log marostegui@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host es2028.codfw.wmnet with OS trixie [11:29:56] (03PS1) 10Elukey: admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 [11:32:03] the slow times may be related to pushing the image layers to swift, we should really start trying the ceph-based backend for /restricted [11:32:04] Dreamy_Jazz: it usually takes 45 minutes based on https://sal.toolforge.org/production?p=0&q=%22Finished+scap+sync-world%3A+testwikis%22&d= [11:32:31] but it will need more tests, so something not immediate :( [11:33:11] Thanks for the context. I have time to wait and monitor this proceed [11:35:04] elukey@cumin1003 reimage (PID 1159643) is awaiting input [11:39:43] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1029.eqiad.wmnet with OS trixie [11:40:38] !log marostegui@cumin1003 START - Cookbook sre.hosts.provision for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [11:41:53] Flurry of activity in `/var/lib/spiderpig/scap-image-build-and-push-log` [11:42:22] The push-and-build completed successfully, it's now on to the sync-masters step [11:43:07] jouncebot: nowandnext [11:43:07] For the next 0 hour(s) and 16 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1100) [11:43:07] In 1 hour(s) and 16 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300) [11:43:36] (03PS1) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [11:43:39] sync-master is going slower than normal, likely because it needs to copy more data like a i18n backport [11:44:31] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [11:44:44] (03CR) 10Cathal Mooney: "Thanks Matthew. I'm 99% sure it'll "Just Work Fine"TM. But similarly if it's easy to depool a host and apply it there first I'd say let'" [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [11:46:28] (03PS2) 10Ayounsi: [WIP] Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) [11:50:52] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [11:50:58] T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086 [11:50:58] T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087 [11:51:00] (03PS1) 10Gmodena: wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782) [11:51:20] (03CR) 10Ayounsi: "Tested in Netbox-next" [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [11:51:50] (03PS3) 10Ayounsi: Capirca: only show diff when running in "non-commit" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1218209 (https://phabricator.wikimedia.org/T361549) [11:54:01] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [11:54:31] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1003 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [11:55:00] hashar: It seems my scap backport has also synced testwikis to wmf.7 based on https://versions.toolforge.org/? [11:56:02] Yeah https://test.wikipedia.org/wiki/Special:Version on the debug servers says wmf.7 and not on the debug servers says wmf.5 [11:56:55] So I guess the train should be synced to the testwikis by this change and nothing else would be needed. I can ping you when I'm done if you want to check? [11:59:51] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [12:03:40] (03CR) 10A-pizzata: [C:03+1] Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [12:04:01] (03PS1) 10Btullis: Update the spark config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218741 (https://phabricator.wikimedia.org/T410017) [12:05:29] marostegui@cumin1003 provision (PID 1202859) is awaiting input [12:06:09] 06SRE, 10SRE-Access-Requests: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796 (10MatthewVernon) 03NEW [12:07:02] (03PS1) 10Bunnypranav: core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) [12:07:33] (03CR) 10Btullis: [C:03+2] Update the spark config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218741 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:08:23] (03PS1) 10MVernon: admin: add fido-backed key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218744 (https://phabricator.wikimedia.org/T412796) [12:08:39] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218342|Remove definition of wgGlobalBlockingEnableAutoblocks (T379086)]], [[gerrit:1218343|Show global autoblocks in the globalblocks list API response (T379087)]] (duration: 67m 55s) [12:08:45] T379086: Remove wgGlobalBlockingEnableAutoblocks - https://phabricator.wikimedia.org/T379086 [12:08:45] T379087: Remove wgGlobalBlockingHideAutoblocksInGlobalBlocksAPIResponse - https://phabricator.wikimedia.org/T379087 [12:08:54] Proceeding the train made an issue appear for one of PSI teams tools, so will want to backport shortly again :D [12:09:39] (03Merged) 10jenkins-bot: Update the spark config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218741 (https://phabricator.wikimedia.org/T410017) (owner: 10Btullis) [12:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:10:59] (03PS1) 10MVernon: admin: add fido-based ssh key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218745 (https://phabricator.wikimedia.org/T412796) [12:11:51] (03CR) 10JMeybohm: [C:03+1] wikikube: Add wikikube-ctrl200[4-5] [puppet] - 10https://gerrit.wikimedia.org/r/1195350 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [12:11:52] (03Abandoned) 10MVernon: admin: add fido-backed key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218744 (https://phabricator.wikimedia.org/T412796) (owner: 10MVernon) [12:12:05] (03CR) 10JMeybohm: [C:03+1] Add wikikube-ctrl2004 and wikikube-ctrl2005 to the etcd-server SRV record [dns] - 10https://gerrit.wikimedia.org/r/1218351 (https://phabricator.wikimedia.org/T390861) (owner: 10Jasmine) [12:12:44] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:12:53] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:14:17] (03PS1) 10Dreamy Jazz: Follow-up: SI: Add "past checks" link next to accounts in table pager [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218747 (https://phabricator.wikimedia.org/T411268) [12:14:29] jouncebot: nowandnext [12:14:29] No deployments scheduled for the next 0 hour(s) and 45 minute(s) [12:14:29] In 0 hour(s) and 45 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300) [12:14:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218747 (https://phabricator.wikimedia.org/T411268) (owner: 10Dreamy Jazz) [12:14:47] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) (owner: 10Bunnypranav) [12:15:59] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:17:39] (03CR) 10Marostegui: [C:03+1] "@fceratto@wikimedia.org this is not yet submitted right?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [12:18:15] (03CR) 10Marostegui: "As agreed during the meeting, let's make this a separate cookbook for now, so we don't alter the existing one." [cookbooks] - 10https://gerrit.wikimedia.org/r/1215575 (https://phabricator.wikimedia.org/T411573) (owner: 10Federico Ceratto) [12:19:08] (03CR) 10Marostegui: [C:03+1] prometheus-mariadb-replication-lag.py: mysql_heartbeat_lag_seconds metric [puppet] - 10https://gerrit.wikimedia.org/r/1217492 (https://phabricator.wikimedia.org/T384810) (owner: 10Federico Ceratto) [12:19:50] (03PS1) 10Btullis: Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) [12:19:58] (03PS2) 10Btullis: Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) [12:19:59] (03CR) 10CI reject: [V:04-1] Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:22:08] (03PS2) 10Aqu: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) [12:23:01] (03CR) 10Aqu: "I've removed the duplicated declaration of the value of the number of instances (3)." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu) [12:24:17] (03CR) 10Btullis: [C:03+2] Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:25:58] (03Merged) 10jenkins-bot: Allow the spark serviceaccount to manage secrets within the namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218748 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [12:27:04] !log marostegui@cumin1003 START - Cookbook sre.hosts.provision for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:27:14] (03Merged) 10jenkins-bot: Follow-up: SI: Add "past checks" link next to accounts in table pager [extensions/CheckUser] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218747 (https://phabricator.wikimedia.org/T411268) (owner: 10Dreamy Jazz) [12:27:47] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218747|Follow-up: SI: Add "past checks" link next to accounts in table pager (T411268)]] [12:27:51] T411268: Suggested Investigations: Show link to checkuser log if target has been checked before - https://phabricator.wikimedia.org/T411268 [12:28:18] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [12:28:26] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [12:29:18] (03CR) 10Dpogorzelski: [C:03+2] docker_registry: allow ml-build1001 [puppet] - 10https://gerrit.wikimedia.org/r/1218707 (https://phabricator.wikimedia.org/T412524) (owner: 10Dpogorzelski) [12:31:49] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1218747|Follow-up: SI: Add "past checks" link next to accounts in table pager (T411268)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [12:32:27] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [12:32:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11463708 (10Marostegui) p:05Triage→03Medium @MatthewVernon I guess you'll handle this yourself? I can verify the ssh key out of band if you need help from clinic duty. [12:34:08] !log marostegui@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es2028.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [12:34:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11463714 (10MatthewVernon) @Marostegui I think @MoritzMuehlenhoff wanted to verify the new pubkey, so I'll tag him as reviewer on the CR. [12:35:15] !log marostegui@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [12:35:28] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [12:35:42] 06SRE, 10decommission-hardware: decommission puppetmaster1003 - https://phabricator.wikimedia.org/T412800#11463729 (10MoritzMuehlenhoff) [12:36:12] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11463731 (10Marostegui) Sounds good! Let me know if you need any help from me as I am on clinic duty this week. [12:38:35] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218747|Follow-up: SI: Add "past checks" link next to accounts in table pager (T411268)]] (duration: 10m 47s) [12:38:35] !log jmm@cumin2002 START - Cookbook sre.hosts.decommission for hosts puppetmaster1003.eqiad.wmnet [12:38:39] T411268: Suggested Investigations: Show link to checkuser log if target has been checked before - https://phabricator.wikimedia.org/T411268 [12:40:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1218745 (https://phabricator.wikimedia.org/T412796) (owner: 10MVernon) [12:41:42] (03CR) 10MVernon: [C:03+2] admin: add fido-based ssh key for mvernon [puppet] - 10https://gerrit.wikimedia.org/r/1218745 (https://phabricator.wikimedia.org/T412796) (owner: 10MVernon) [12:43:01] (03CR) 10Btullis: [C:03+2] postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu) [12:45:00] (03Merged) 10jenkins-bot: postgresql-airflow-main: Increase pgbouncer pool size [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218709 (https://phabricator.wikimedia.org/T411990) (owner: 10Aqu) [12:45:52] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [12:48:32] FIRING: [8x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:51:11] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [12:51:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [12:51:33] jmm@cumin2002 decommission (PID 2783760) is awaiting input [12:52:13] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:53:07] 06SRE: Migrate ipblocks from fetch_external_clouds_vendors_nets.py to HIDDENPARMA - https://phabricator.wikimedia.org/T412805 (10JMeybohm) 03NEW [12:55:19] jmm@cumin2002 decommission (PID 2783760) is awaiting input [12:55:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetmaster1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin2002" [12:55:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [12:55:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetmaster1003.eqiad.wmnet [12:55:59] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#11463871 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin2002 for hosts: `puppetmaster1003.eqiad.wmnet` - puppetmaster1003.... [12:57:11] (03PS1) 10Muehlenhoff: remove puppetmaster1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1218754 (https://phabricator.wikimedia.org/T412800) [12:58:13] (03PS1) 10Clément Goubert: team-sre/mw-cron: Improve dashboard and description [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) [12:58:48] (03CR) 10Urbanecm: "Thank you for the changes! I just have one last question about this, otherwise, this looks good to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1300) [13:01:30] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11463897 (10JMeybohm) 05Open→03Resolved a:03JMeybohm With {T352245} resolved, this has now been completed. [13:01:39] !log fix network configuration and reboot cloudcephosd1052 - T399180 [13:01:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:01:43] T399180: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180 [13:02:24] (03PS2) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [13:03:09] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807 (10cmooney) 03NEW p:05Triage→03Medium [13:03:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:03:22] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11463921 (10cmooney) [13:05:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish) [13:05:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish) [13:06:23] !log disable puppet on O:swift::proxy [13:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:05] 06SRE, 10MinT, 10Prod-Kubernetes, 06serviceops, and 3 others: machinetranslation eqiad pods in state ContainerStatusUnknown - https://phabricator.wikimedia.org/T411058#11463940 (10Nikerabbit) See also {T386371} which mentions that one pod uses more memory than others. [13:07:40] (03CR) 10Dpogorzelski: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [13:08:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:09:23] (03CR) 10Cathal Mooney: [C:03+2] Swift-proxy: set DSCP on outbound packets to AF41 for network QoS [puppet] - 10https://gerrit.wikimedia.org/r/1218347 (owner: 10Cathal Mooney) [13:09:52] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add docker-pkg [puppet] - 10https://gerrit.wikimedia.org/r/1218211 (owner: 10Dpogorzelski) [13:14:25] !log depool ms-fe1010 for testing [13:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:20] (03PS1) 10Sbisson: CX3 Build 1.0.0+20251215 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) [13:15:55] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 13Patch-For-Review: gitlab2002: wrong network for public IPV4 and IPV6 - https://phabricator.wikimedia.org/T370018#11464044 (10ayounsi) a:05cmooney→03ayounsi [13:15:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [13:16:48] (03PS1) 10Dreamy Jazz: Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) [13:17:30] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [13:17:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deplo" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) (owner: 10Dreamy Jazz) [13:19:02] (03CR) 10Muehlenhoff: [C:03+2] remove puppetmaster1003 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1218754 (https://phabricator.wikimedia.org/T412800) (owner: 10Muehlenhoff) [13:20:13] (03PS3) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [13:20:23] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11464054 (10fgiunchedi) >>! In T399180#11432250, @cmooney wrote: >>>! In T399180#11432052, @fgiunchedi wrote: >> I think the easiest would be to: >> >> * Remove the spuri... [13:20:26] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster1003 - https://phabricator.wikimedia.org/T412800#11464055 (10MoritzMuehlenhoff) [13:21:21] 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11464058 (10fgiunchedi) JFYI we can now proceed with cloudcephosd1052 too [13:23:29] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11464079 (10MoritzMuehlenhoff) 05Resolved→03Open The various certs still need to be cleaned out, reopening [13:24:37] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet'] [13:25:16] (03CR) 10Elukey: [C:03+2] Pyrra: add the MWH completeness SLO under Data Platform [puppet] - 10https://gerrit.wikimedia.org/r/1218226 (https://phabricator.wikimedia.org/T401892) (owner: 10Elukey) [13:29:45] !log repool ms-fe1010 [13:29:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:28] !log enable puppet on O:swift::proxy [13:30:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:34:32] (03PS4) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [13:34:36] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [13:34:47] jmm@cumin2002 upgrade-firmware (PID 2805516) is awaiting input [13:35:54] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es2028.codfw.wmnet'] [13:36:05] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet'] [13:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:41:05] (03PS2) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 [13:43:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2028.codfw.wmnet'] [13:43:35] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet'] [13:44:15] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS trixie [13:48:51] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [13:49:59] (03CR) 10Mszwarc: "Funny thing... This patch causes temp. accounts on GC lose their background (but not outline): https://phabricator.wikimedia.org/F71089735" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders) [13:50:19] 06SRE, 06Data-Persistence: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11464220 (10MatthewVernon) 05Open→03Stalled a:03MatthewVernon [13:50:36] 06SRE, 06Infrastructure-Foundations, 10netops: InboundInterfaceErrors alerts firing for Nokia switches on v25.10.1 - https://phabricator.wikimedia.org/T412733#11464225 (10ayounsi) a:03Papaul @Papaul would you be ok to work with Nokia's support to figure out what those inbound errors mean ? Thanks [13:50:39] 06SRE, 06Data-Persistence: Add FIDO ssh key for mvernon - https://phabricator.wikimedia.org/T412796#11464228 (10MatthewVernon) Reassigning to myself to do the clearup of the software-key in due course. [13:50:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['es2028.codfw.wmnet'] [13:51:49] (03PS5) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [13:52:45] (03CR) 10Krinkle: [C:03+1] Remove LoggedOut cookie logic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [13:53:00] (03CR) 10Krinkle: [C:03+1] Remove LoggedOut cookie handling [puppet] - 10https://gerrit.wikimedia.org/r/1217774 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [13:54:20] Dreamy_Jazz: thanks for the update, sorry I went out for lunch! I'll check the train status [13:55:35] !log marostegui@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [13:56:15] (03CR) 10Mszwarc: "This also happens in the current situation when you visit GC, but have no permissions to do IP Reveal – e.g., going to https://meta.wikime" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders) [13:56:47] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet'] [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1400) [14:00:04] Bunnypranav, hamishcz, stephanebisson, and Dreamy_Jazz: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86657 and previous config saved to /var/cache/conftool/dbconfig/20251216-140008-marostegui.json [14:00:11] o/ [14:00:13] i'm here :) [14:00:14] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:00:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:00:52] o/ [14:02:00] 06SRE, 06serviceops, 07Epic, 13Patch-For-Review: Phase out cergen for ServiceOps services - https://phabricator.wikimedia.org/T360636#11464299 (10JMeybohm) a:05JMeybohm→03MoritzMuehlenhoff Thanks for volunteering to remove the remaining certs and cergen config during your January cleanup [14:02:04] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [14:02:08] bunnypranav can you deploy your change or do you need a deployer to do it? [14:02:27] I will need a deployer [14:03:14] bunnypranav are you able to test it during the deployment? [14:03:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:03:33] stephanebisson yes I can test it [14:03:51] bunnypranav ok, I'll deploy it for you [14:04:07] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['es2028.codfw.wmnet'] [14:04:10] !log ayounsi@cumin1003 START - Cookbook sre.hosts.provision for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:04:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) (owner: 10Bunnypranav) [14:04:17] \o [14:04:29] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es2028.codfw.wmnet'] [14:04:59] (03Merged) 10jenkins-bot: core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218743 (https://phabricator.wikimedia.org/T412791) (owner: 10Bunnypranav) [14:05:20] (03CR) 10Muehlenhoff: [C:03+2] proton: Bump image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218290 (owner: 10Muehlenhoff) [14:05:22] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2003.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:05:32] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218743|core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki (T412791)]] [14:05:36] T412791: jawiki: Add abusefilter-access-protected-vars to temporary-account-viewer - https://phabricator.wikimedia.org/T412791 [14:05:51] stephanebisson: Thanks for the help! [14:06:35] !log jmm@deploy2002 helmfile [staging] START helmfile.d/services/proton: apply [14:06:52] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [14:07:46] !log sbisson@deploy2002 bunnypranav, sbisson: Backport for [[gerrit:1218743|core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki (T412791)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:07:52] !log ayounsi@cumin1003 START - Cookbook sre.hosts.reboot-single for host sretest2003.codfw.wmnet [14:07:53] testing [14:08:04] !log jmm@deploy2002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:08:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:08:34] stephanebisson: All good, works as intended! [14:08:44] !log sbisson@deploy2002 bunnypranav, sbisson: Continuing with sync [14:09:04] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [14:10:01] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['es2028.codfw.wmnet'] [14:10:29] (03CR) 10Ayounsi: "recheck" [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [14:10:48] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest2003.codfw.wmnet [14:10:55] (03CR) 10Mszwarc: "Reported as: https://phabricator.wikimedia.org/T412823" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218719 (https://phabricator.wikimedia.org/T412530) (owner: 10Tchanders) [14:11:33] (03PS1) 10Dpogorzelski: ml-build: add missing step [puppet] - 10https://gerrit.wikimedia.org/r/1218771 [14:11:47] (03CR) 10Dpogorzelski: [C:03+2] ml-build: add missing step [puppet] - 10https://gerrit.wikimedia.org/r/1218771 (owner: 10Dpogorzelski) [14:11:49] (03CR) 10Dpogorzelski: [V:03+2 C:03+2] ml-build: add missing step [puppet] - 10https://gerrit.wikimedia.org/r/1218771 (owner: 10Dpogorzelski) [14:11:52] !log jmm@deploy2002 helmfile [codfw] START helmfile.d/services/proton: apply [14:12:47] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218743|core-Permission: Add abusefilter-access-protected-vars to temporary-account-viewer in jawiki (T412791)]] (duration: 07m 15s) [14:12:51] T412791: jawiki: Add abusefilter-access-protected-vars to temporary-account-viewer - https://phabricator.wikimedia.org/T412791 [14:13:09] over to you hamishcz [14:13:11] !log jmm@deploy2002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:13:17] :) [14:13:36] awaiting for testing/ [14:13:37] stephanebisson: Thank you for the quick assistance! [14:14:57] hamishcz are you deploying it yourself? [14:15:09] nah i cant do that [14:15:12] need your help [14:15:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P86658 and previous config saved to /var/cache/conftool/dbconfig/20251216-141517-marostegui.json [14:15:26] hamishcz ok, I'll help you [14:15:27] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464375 (10MoritzMuehlenhoff) JFTR, I upgraded firmware and IDRAC in the mean time to the latest releases. [14:15:48] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish) [14:16:38] (03Merged) 10jenkins-bot: zhwiki: enable protection indicators [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218298 (https://phabricator.wikimedia.org/T412710) (owner: 10Hamish) [14:17:08] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218298|zhwiki: enable protection indicators (T412710)]] [14:17:12] T412710: Enable protection indicators for zhwiki - https://phabricator.wikimedia.org/T412710 [14:19:27] !log sbisson@deploy2002 sbisson, hamishz: Backport for [[gerrit:1218298|zhwiki: enable protection indicators (T412710)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:19:38] hamishcz ^ [14:20:27] tested and work as intended [14:21:17] !log sbisson@deploy2002 sbisson, hamishz: Continuing with sync [14:24:24] !log jmm@deploy2002 helmfile [eqiad] START helmfile.d/services/proton: apply [14:24:40] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [14:24:53] elukey@cumin1003 reimage (PID 1324133) is awaiting input [14:24:58] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11464421 (10Papaul) @ayounsi what else needs to be done here? [14:25:13] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218298|zhwiki: enable protection indicators (T412710)]] (duration: 08m 05s) [14:25:16] T412710: Enable protection indicators for zhwiki - https://phabricator.wikimedia.org/T412710 [14:25:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish) [14:25:42] hamishcz ^ your other patch [14:26:06] !log jmm@deploy2002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [14:26:08] (03PS3) 10Elukey: DNM - Reimage: dup-uefi after the first puppet run [cookbooks] - 10https://gerrit.wikimedia.org/r/1218731 [14:26:39] (03Merged) 10jenkins-bot: svwiki: lift autoconfirmed setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218302 (https://phabricator.wikimedia.org/T412713) (owner: 10Hamish) [14:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [14:27:12] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218302|svwiki: lift autoconfirmed setting (T412713)]] [14:27:15] T412713: Set $wgAutoConfirmCount to 10 for sv.wikipedia - https://phabricator.wikimedia.org/T412713 [14:27:37] (03PS2) 10Elukey: admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 [14:28:46] (03CR) 10JMeybohm: [C:03+1] admin_ng: bump kartotherian's cpu quotas to have smoother deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218737 (owner: 10Elukey) [14:28:49] this one is not active yet? [14:29:12] !log installing glibc security updates [14:29:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:29:24] hamishcz soon [14:29:30] !log sbisson@deploy2002 sbisson, hamishz: Backport for [[gerrit:1218302|svwiki: lift autoconfirmed setting (T412713)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:30:24] 06SRE, 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#11464449 (10Arendpieter) [14:30:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P86659 and previous config saved to /var/cache/conftool/dbconfig/20251216-143025-marostegui.json [14:31:31] hamishcz you can test now [14:31:53] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11464460 (10Eevans) 05Open→03Resolved [14:32:08] gimme a sec [14:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:32:38] (03PS12) 10Daniel Kinzler: rest gateway: add smoke tests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 [14:32:39] ah yes good to continue [14:32:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86660 and previous config saved to /var/cache/conftool/dbconfig/20251216-143244-marostegui.json [14:32:50] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:32:50] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:33:01] !log sbisson@deploy2002 sbisson, hamishz: Continuing with sync [14:33:52] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wdqs1029.eqiad.wmnet with OS trixie [14:36:51] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Servers exposing incorrect LLDP info - https://phabricator.wikimedia.org/T250367#11464499 (10ayounsi) I was working on that as we speak. As sretest2003 was reclaimed to test hosts I was able to run some more tests. Running the still not m... [14:37:01] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218302|svwiki: lift autoconfirmed setting (T412713)]] (duration: 09m 49s) [14:37:05] T412713: Set $wgAutoConfirmCount to 10 for sv.wikipedia - https://phabricator.wikimedia.org/T412713 [14:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:37:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by sbisson@deploy2002 using scap backport" [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [14:37:54] (03CR) 10Tiziano Fogli: [C:04-1] "Tested on Pontoon: the config file does not pass validation due to the trailing “:” highlighted." [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [14:39:23] (03Merged) 10jenkins-bot: CX3 Build 1.0.0+20251215 [extensions/ContentTranslation] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1218762 (https://phabricator.wikimedia.org/T408842) (owner: 10Sbisson) [14:39:56] !log sbisson@deploy2002 Started scap sync-world: Backport for [[gerrit:1218762|CX3 Build 1.0.0+20251215 (T408842 T411779)]] [14:40:02] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [14:40:02] T411779: Handle invalid featured collection name - https://phabricator.wikimedia.org/T411779 [14:40:40] (03PS6) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [14:40:54] (03PS3) 10Gmodena: alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) [14:41:40] (03CR) 10Gmodena: alertmanager: onboard wikidata platform. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [14:41:55] (03PS1) 10Gmodena: wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782) [14:42:15] !log sbisson@deploy2002 sbisson: Backport for [[gerrit:1218762|CX3 Build 1.0.0+20251215 (T408842 T411779)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:42:25] (03PS1) 10LorenMora: [Legal Footer] Deploy Legal Footer for Phase 1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218775 (https://phabricator.wikimedia.org/T412455) [14:43:23] !log sbisson@deploy2002 sbisson: Continuing with sync [14:44:28] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464567 (10cmooney) It seems the interface can be set through the [[ https://www.debian.org/releases/trixie/example-preseed.txt | preseed ]] file... [14:45:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86661 and previous config saved to /var/cache/conftool/dbconfig/20251216-144533-marostegui.json [14:45:39] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:45:40] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:45:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2198.codfw.wmnet with reason: Maintenance [14:46:09] (03CR) 10Tiziano Fogli: [C:03+1] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [14:46:21] stephanebisson: thanks! [14:47:24] !log sbisson@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218762|CX3 Build 1.0.0+20251215 (T408842 T411779)]] (duration: 07m 27s) [14:47:29] T408842: Surface nominated collections in Search view - https://phabricator.wikimedia.org/T408842 [14:47:29] T411779: Handle invalid featured collection name - https://phabricator.wikimedia.org/T411779 [14:47:49] over to you Dreamy_Jazz [14:47:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P86662 and previous config saved to /var/cache/conftool/dbconfig/20251216-144752-marostegui.json [14:47:57] Thanks [14:48:08] (03PS1) 10Jsn.sherman: [Moderator tools] Add data-mw-interface in addition to data-mw="interface" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) [14:48:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) (owner: 10Dreamy Jazz) [14:48:51] (03CR) 10Fabfur: [C:03+2] P::cache::haproxy: enable QOS for video files [puppet] - 10https://gerrit.wikimedia.org/r/1218712 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [14:49:15] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) (owner: 10Jsn.sherman) [14:49:23] (03Merged) 10jenkins-bot: Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218763 (https://phabricator.wikimedia.org/T361173) (owner: 10Dreamy Jazz) [14:49:52] !log dreamyjazz@deploy2002 Started scap sync-world: Backport for [[gerrit:1218763|Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD (T361173)]] [14:49:57] T361173: Add schema migration config for cu_useragent table - https://phabricator.wikimedia.org/T361173 [14:52:07] !log dreamyjazz@deploy2002 dreamyjazz: Backport for [[gerrit:1218763|Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD (T361173)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:52:48] !log dreamyjazz@deploy2002 dreamyjazz: Continuing with sync [14:56:48] !log dreamyjazz@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218763|Pin $wgCheckUserUserAgentTableMigrationStage as SCHEMA_COMPAT_OLD (T361173)]] (duration: 06m 55s) [14:56:52] T361173: Add schema migration config for cu_useragent table - https://phabricator.wikimedia.org/T361173 [14:57:12] !log Afternoon UTC backport window done [14:57:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] (03PS1) 10Btullis: Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833) [14:59:27] (03PS2) 10Btullis: Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833) [15:00:05] Deploy window Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1500) [15:02:36] (03CR) 10Btullis: [C:03+2] Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:03:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181', diff saved to https://phabricator.wikimedia.org/P86663 and previous config saved to /var/cache/conftool/dbconfig/20251216-150301-marostegui.json [15:04:15] (03Merged) 10jenkins-bot: Update spark/hadoop mountpoints and environment variables [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218778 (https://phabricator.wikimedia.org/T406833) (owner: 10Btullis) [15:05:57] jouncebot: nowandnext [15:05:57] For the next 0 hour(s) and 24 minute(s): Test Kitchen UI Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1500) [15:05:57] In 0 hour(s) and 24 minute(s): Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1530) [15:06:03] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/analytics-test: apply [15:06:13] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/analytics-test: apply [15:06:56] Anyone using scap in this window? Want to deploy a private code change [15:08:30] (03PS1) 10Kosta Harlan: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) [15:10:03] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:09] (03PS2) 10Kosta Harlan: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) [15:10:12] (03CR) 10Clément Goubert: [C:03+1] rest-gateway: support REST sandbox requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1207267 (https://phabricator.wikimedia.org/T396807) (owner: 10Aaron Schulz) [15:13:15] (03PS3) 10Kosta Harlan: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) [15:13:37] (03CR) 10Dreamy Jazz: [C:03+1] hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [15:15:06] I'm deploying a config patch [15:15:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [15:16:02] (03Merged) 10jenkins-bot: hCaptcha: end frwiki A/B test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218780 (https://phabricator.wikimedia.org/T405239) (owner: 10Kosta Harlan) [15:16:34] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1218780|hCaptcha: end frwiki A/B test (T405239)]] [15:16:38] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [15:18:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1181 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86664 and previous config saved to /var/cache/conftool/dbconfig/20251216-151809-marostegui.json [15:18:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:18:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:18:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1191.eqiad.wmnet with reason: Maintenance [15:18:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1191 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86665 and previous config saved to /var/cache/conftool/dbconfig/20251216-151834-marostegui.json [15:18:54] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1218780|hCaptcha: end frwiki A/B test (T405239)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:22:37] (03PS6) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [15:22:50] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11464876 (10ABran-WMF) I've read through the backlog of this task and followed {T411895} to try and figure out how I could move mailman's web interface behi... [15:23:16] 10ops-eqiad, 06SRE, 06DC-Ops: aqs1012 is down - https://phabricator.wikimedia.org/T407414#11464880 (10Jclark-ctr) a:05Eevans→03Jclark-ctr [15:26:00] !log kharlan@deploy2002 kharlan: Continuing with sync [15:26:23] !log cleanup temp files on archiva1002 [15:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:25] (03PS1) 10Bking: bking: Add FIDO-backed SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1218782 [15:30:00] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218780|hCaptcha: end frwiki A/B test (T405239)]] (duration: 13m 26s) [15:30:04] Deploy window Test Kitchen Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1530) [15:30:05] T405239: hCaptcha: Enable A/B test for frwiki - https://phabricator.wikimedia.org/T405239 [15:31:28] (03CR) 10Jelto: unlink wikipedia25.org from ncredir, point to k8s-ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [15:33:35] (03CR) 10Btullis: [C:03+1] "Looks good to me. I also asked the user to supply the key for checking via Slack, for an out-of-band identity check." [puppet] - 10https://gerrit.wikimedia.org/r/1218782 (owner: 10Bking) [15:35:03] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:39] (03CR) 10CI reject: [V:04-1] Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) (owner: 10Ayounsi) [15:41:11] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464928 (10Jhancock.wm) @cmooney There is nothing plugged into any of the ports on this server except the expected. idrac and the first 1G port.... [15:41:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1218782 (owner: 10Bking) [15:41:48] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:42:11] (03CR) 10Bking: [C:03+2] bking: Add FIDO-backed SSH key [puppet] - 10https://gerrit.wikimedia.org/r/1218782 (owner: 10Bking) [15:42:44] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11464931 (10cmooney) >>! In T412807#11464928, @Jhancock.wm wrote: > @cmooney There is nothing plugged into any of the ports on this server except... [15:45:02] !log jmm@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:45:20] (03PS1) 10Cathal Mooney: DNS discovery: split responses to magru servers based on rack [puppet] - 10https://gerrit.wikimedia.org/r/1218784 (https://phabricator.wikimedia.org/T411617) [15:45:52] (03PS1) 10Elukey: setup.py: avoid Sphinx >= 9.x [software/homer] - 10https://gerrit.wikimedia.org/r/1218785 [15:46:06] !log jmm@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:47:24] (03PS7) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [15:47:37] !log Restarting CI Jenkins [15:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:58] (03PS8) 10Ayounsi: Capirca: look for the most recent completed run [software/homer] - 10https://gerrit.wikimedia.org/r/1218739 (https://phabricator.wikimedia.org/T361549) [15:51:07] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster2002 - https://phabricator.wikimedia.org/T412783#11464950 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:53:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission puppetmaster1003 - https://phabricator.wikimedia.org/T412800#11464981 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [15:55:57] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:00:05] jelto, arnoldokoth, mutante, and arnaudb: It is that lovely time of the day again! You are hereby commanded to deploy SRE Collaboration Services office hours. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1600). [16:01:51] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: Phabricator deploy [16:02:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:02:33] !log jelto@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: Phabricator deploy [16:03:03] !log brennen@deploy2002 Started deploy [phabricator/deployment@3a23687]: deploy phab2002 for T412825 [16:03:07] T412825: Deploy Phab/Phorge 2025-12-16 - https://phabricator.wikimedia.org/T412825 [16:03:34] !log brennen@deploy2002 Finished deploy [phabricator/deployment@3a23687]: deploy phab2002 for T412825 (duration: 00m 31s) [16:03:50] !log brennen@deploy2002 Started deploy [phabricator/deployment@3a23687]: deploy phab1004 for T412825 [16:04:48] !log brennen@deploy2002 Finished deploy [phabricator/deployment@3a23687]: deploy phab1004 for T412825 (duration: 00m 58s) [16:05:57] (03PS1) 10Fabfur: hiera: remove set-tos experiment [puppet] - 10https://gerrit.wikimedia.org/r/1218788 (https://phabricator.wikimedia.org/T412785) [16:05:57] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:07:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:08:07] (03PS2) 10Fabfur: hiera: remove set-tos experiment [puppet] - 10https://gerrit.wikimedia.org/r/1218788 (https://phabricator.wikimedia.org/T412785) [16:10:15] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:14:48] (03CR) 10BCornwall: [C:03+1] wikimediafoundation.org: Add AAAA for non-apex records as well [dns] - 10https://gerrit.wikimedia.org/r/1217582 (https://phabricator.wikimedia.org/T403269) (owner: 10Majavah) [16:15:47] (03CR) 10BCornwall: [C:03+2] wikimediafoundation.org: Add AAAA for non-apex records as well [dns] - 10https://gerrit.wikimedia.org/r/1217582 (https://phabricator.wikimedia.org/T403269) (owner: 10Majavah) [16:15:59] !log brett@dns1006 START - running authdns-update [16:17:35] (03CR) 10Fabfur: [C:03+2] hiera: remove set-tos experiment [puppet] - 10https://gerrit.wikimedia.org/r/1218788 (https://phabricator.wikimedia.org/T412785) (owner: 10Fabfur) [16:18:15] !log brett@dns1006 END - running authdns-update [16:18:25] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11465060 (10RKemper) >>! In T411919#11454698, @Jclark-ctr wrote: > @RKemper I am usually here most mornings early. what day would work best for you next week to down time is... [16:20:55] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11465064 (10MoritzMuehlenhoff) [16:25:09] (03CR) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress (033 comments) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:25:54] (03CR) 10Daniel Kinzler: rest gateway: add smoke tests (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215605 (owner: 10Daniel Kinzler) [16:28:16] 06SRE, 10LDAP-Access-Requests, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Access Admin menu in Airflow - https://phabricator.wikimedia.org/T412119#11465084 (10APizzata-WMF) Thanks @BTullis, I can now see the menu! [16:28:52] (03CR) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:30:05] (03CR) 10Jelto: unlink wikipedia25.org from ncredir, point to k8s-ingress (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:31:54] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1216855 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:32:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86667 and previous config saved to /var/cache/conftool/dbconfig/20251216-163252-marostegui.json [16:32:58] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:32:58] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [16:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:40:53] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:41:25] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [16:42:43] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:43:07] (03CR) 10Jelto: miscweb: add wikipedia25.org to extra SANs (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [16:45:01] RECOVERY - Postfix SMTP on crm2001 is OK: OK - Certificate crm2001.codfw.wmnet will expire on Tue 13 Jan 2026 04:10:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [16:45:18] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218791 (https://phabricator.wikimedia.org/T412816) [16:47:00] !log installing unbound security updates [16:47:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P86668 and previous config saved to /var/cache/conftool/dbconfig/20251216-164800-marostegui.json [16:48:25] (03CR) 10STran: [C:03+2] "self-merging, as ipoid is actively broken" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218791 (https://phabricator.wikimedia.org/T412816) (owner: 10STran) [16:50:33] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218791 (https://phabricator.wikimedia.org/T412816) (owner: 10STran) [16:51:35] (03PS1) 10Isabelle Hurbain-Palatin: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) [16:52:32] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [16:52:59] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [16:53:31] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [16:53:44] (03CR) 10Elukey: [C:03+2] setup.py: avoid Sphinx >= 9.x [software/homer] - 10https://gerrit.wikimedia.org/r/1218785 (owner: 10Elukey) [16:54:01] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [16:54:24] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [16:54:46] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1700). [17:00:05] tgr: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:30] o/ [17:00:54] tgr: o/ this looks reasonable to me but because it's a VCL change I'd like to get the traffic team to deploy it [17:01:01] er, tgr_: sorry [17:01:16] !log derick@deploy2002 mwscript-k8s job started: extensions/CentralAuth/maintenance/fixStuckGlobalRename.php --wiki=enwikibooks --logwiki=metawiki Magiuser 'Renamed user f3a49d320a6984a0d6b403d313476916' # T412784 [17:01:20] T412784: Unblock stuck global rename of Renamed user f3a49d320a6984a0d6b403d313476916 - https://phabricator.wikimedia.org/T412784 [17:01:36] sure [17:01:38] will you want to be around to test that live when it goes out? or does it just need a deployer, and we can ship it whenever? [17:01:57] the cookie has not been emitted for years, it seems [17:02:09] and no one seems to be sure what it did in the past [17:02:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:02:18] haha got it [17:02:21] so, nothing to test, I take it :) [17:02:21] so I wouldn't have any idea what to test [17:02:24] cool [17:02:50] thx! [17:03:00] in that case let me follow up and get it handled async -- sorry for the extra delay, but you can consider it taken care of [17:03:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191', diff saved to https://phabricator.wikimedia.org/P86669 and previous config saved to /var/cache/conftool/dbconfig/20251216-170308-marostegui.json [17:03:12] if you don't hear anything and it doesn't get done, feel free to follow up with me or with traffic [17:03:22] no worries, it's just cleanup in any case, not time sensitive at all [17:03:25] 👍 [17:07:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:13:04] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11465357 (10Dzahn) In the change merged back in 2024: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1072247/9/hieradata/common/profile/trafficserver/... [17:14:12] !log cmooney@cumin1003 START - Cookbook sre.hosts.reimage for host es2028.codfw.wmnet with OS trixie [17:14:22] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie [17:14:23] !log cmooney@cumin1003 START - Cookbook sre.hosts.move-vlan for host es2028 [17:14:47] !log cmooney@cumin1003 START - Cookbook sre.dns.netbox [17:18:05] !log cmooney@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host es2028 - cmooney@cumin1003" [17:18:09] !log cmooney@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host es2028 - cmooney@cumin1003" [17:18:09] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:18:09] !log cmooney@cumin1003 START - Cookbook sre.dns.wipe-cache es2028.codfw.wmnet 140.0.192.10.in-addr.arpa 0.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:18:12] !log cmooney@cumin1003 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) es2028.codfw.wmnet 140.0.192.10.in-addr.arpa 0.4.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [17:18:13] !log cmooney@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host es2028 [17:18:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1191 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86670 and previous config saved to /var/cache/conftool/dbconfig/20251216-171816-marostegui.json [17:18:24] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:18:24] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:18:30] !log cmooney@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es2028 [17:18:30] !log cmooney@cumin1003 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host es2028 [17:18:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1194.eqiad.wmnet with reason: Maintenance [17:18:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86671 and previous config saved to /var/cache/conftool/dbconfig/20251216-171841-marostegui.json [17:20:18] (03PS1) 10RLazarus: kubernetes: Set default Envoy version to 1.35.7 [puppet] - 10https://gerrit.wikimedia.org/r/1218799 (https://phabricator.wikimedia.org/T410975) [17:20:37] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Put lists.wikimedia.org web interface behind LVS - https://phabricator.wikimedia.org/T286066#11465434 (10Dzahn) You can remove the "Prepare tcpproxy VMs for accepting traffic on the new public IPs" and general tcpproxy part from the list above. That... [17:23:53] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11465441 (10Jclark-ctr) @rkemper I do not have access to run down time [17:24:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:25:53] (03CR) 10Clément Goubert: [C:03+1] kubernetes: Set default Envoy version to 1.35.7 [puppet] - 10https://gerrit.wikimedia.org/r/1218799 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus) [17:29:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:30:12] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11465461 (10Dzahn) >>! In T408592#11452756, @ATitkov wrote: > If anything is still not clear, please ask Hi @ATitkov thanks for the answer... [17:32:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:35:20] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465495 (10cmooney) >>! In T412807#11464931, @cmooney wrote: > Anyway that could also be the culprit, I'll kick off another reimage and see if it... [17:37:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:39:14] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218801 [17:39:32] Hmm what's going on with the errors? Is someone checking? [17:39:48] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218801 (owner: 10Ahmon Dancy) [17:40:32] claime: I'm noticing a lot of "Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded" errors today. [17:40:41] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218801 (owner: 10Ahmon Dancy) [17:40:42] dancy: looks like circuit breaking [17:40:51] (looking at the spike in logstash) [17:42:19] dancy: https://grafana.wikimedia.org/goto/4lRK9PMvR?orgId=1 uhhh [17:42:29] that's a lot of w-o-w increase in connections [17:44:26] I don't have time to debug this unfortunately :/ It's already almost 7PM [17:45:01] Looking back on that graph over 30 days, there seems to be a steady upward trajectory for the codfw connections. [17:45:46] With a big spike around today. [17:45:48] dancy: yes but even just looking at the last 2 days, we have 3x'd the max rps in codfw [17:46:06] 2.91k last week, 7.7k this week [17:46:50] Started during the night of the 11th [17:51:52] (03PS7) 10Daniel Kinzler: rest gateway: split anon class [deployment-charts] - 10https://gerrit.wikimedia.org/r/1217516 (https://phabricator.wikimedia.org/T410379) [17:53:49] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465617 (10cmooney) I see these lines in `/var/log/syslog` in the busybox shell: ` Dec 16 17:31:55 netcfg[1167]: INFO: Activating interface eno1n... [17:54:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:59:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:59:45] !log Cleaned up old files (not deleted by logrotate) on centrallog1002; removed the rsyslog-debug file on centrallog1002. [17:59:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1800) [18:02:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:07:15] RESOLVED: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:17:28] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465779 (10elukey) @cmooney I am +1 on testing something like `d-i netcfg/link_wait_timeout string 10`, it seems an easy one to see if anything c... [18:18:41] (03CR) 10Eric Gardner: [C:03+2] Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia) [18:20:36] (03Merged) 10jenkins-bot: Decommission Article Summaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217799 (https://phabricator.wikimedia.org/T411558) (owner: 10Kimberly Sarabia) [18:20:51] FIRING: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/3/3 (Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:21:06] (03PS1) 10CDanis: Revert^2 "zramswap: notify service on config change" [puppet] - 10https://gerrit.wikimedia.org/r/1218805 [18:21:06] !incidents [18:21:06] 7195 (UNACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [18:21:14] !ack 7195 [18:21:14] 7195 (ACKED) TransitPeeringTransportOutSaturation network sre (cr2-eqiad:9804 Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649} xe-3/3/3 gnmi eqiad) [18:23:16] (03PS2) 10Isabelle Hurbain-Palatin: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) [18:23:16] (03PS1) 10Isabelle Hurbain-Palatin: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) [18:24:11] (03CR) 10CI reject: [V:04-1] Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [18:25:51] RESOLVED: TransitPeeringTransportOutSaturation: Transit, peering or transport OUT traffic above 90% capacity - cr2-eqiad:xe-3/3/3 (Peering: Equinix (3198427-A Wikimedia-DC5-IX-01, MAC filter) {#2649}) #page - https://w.wiki/Gbyf - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutSaturation [18:27:02] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [18:27:42] (03PS1) 10Majavah: P:mail::smarthost: Include Exim queue Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1218807 [18:27:43] (03PS1) 10Majavah: P:mail::smarthost: Remove NRPE monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1218808 [18:30:55] (03CR) 10Bking: [C:03+2] alertmanager: onboard wikidata platform. [puppet] - 10https://gerrit.wikimedia.org/r/1218735 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [18:32:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86672 and previous config saved to /var/cache/conftool/dbconfig/20251216-183208-marostegui.json [18:32:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [18:32:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [18:33:36] (03CR) 10Majavah: [C:03+2] P:mail::smarthost: Include Exim queue Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/1218807 (owner: 10Majavah) [18:33:47] RESOLVED: [4x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:17] (03CR) 10Bking: [C:03+1] wdqs: register blazegraph with wikidata platform [alerts] - 10https://gerrit.wikimedia.org/r/1218740 (https://phabricator.wikimedia.org/T412782) (owner: 10Gmodena) [18:35:02] (03PS2) 10Isabelle Hurbain-Palatin: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) [18:35:03] (03PS3) 10Isabelle Hurbain-Palatin: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) [18:35:54] (03CR) 10CI reject: [V:04-1] Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [18:37:39] (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [18:38:46] !log cmooney@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es2028.codfw.wmnet with OS trixie [18:38:59] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11465914 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1003 for host es2028.codfw.wmnet with OS trixie execu... [18:47:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P86673 and previous config saved to /var/cache/conftool/dbconfig/20251216-184717-marostegui.json [18:47:41] (03PS3) 10Isabelle Hurbain-Palatin: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) [19:00:04] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T1900) [19:02:11] o/ [19:02:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P86674 and previous config saved to /var/cache/conftool/dbconfig/20251216-190225-marostegui.json [19:04:28] (03PS1) 10Dzahn: admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) [19:04:37] (03CR) 10CI reject: [V:04-1] admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:04:52] (03PS2) 10Dzahn: admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) [19:05:36] (03CR) 10Dzahn: [C:03+2] miscweb: add wikipedia25.org to extra SANs (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:05:45] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1218813" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:06:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 343554512 and 34 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:08:39] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 42704 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [19:10:06] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466031 (10ATitkov) > Would it be ok with you if we do that next week, on December 22nd? Yes, I think also Friday 19 Dec is possible, sinc... [19:11:35] (03PS1) 10TrainBranchBot: group0 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218814 (https://phabricator.wikimedia.org/T408277) [19:11:37] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by dancy@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218814 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [19:12:25] (03Merged) 10jenkins-bot: group0 to 1.46.0-wmf.7 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218814 (https://phabricator.wikimedia.org/T408277) (owner: 10TrainBranchBot) [19:13:39] (03CR) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:14:20] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466051 (10ATitkov) In regards to the request that the site should be published at 8:30 UTC on Jan 15th 2026, I am wondering if we can use a... [19:16:43] (03PS3) 10Dzahn: ncredir: remove wikipedia25.org, keep wikipedia25.com to www.wikipedia25.org [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) [19:17:27] (03CR) 10Dzahn: "rebased and answered inline question" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:17:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86675 and previous config saved to /var/cache/conftool/dbconfig/20251216-191733-marostegui.json [19:17:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:17:40] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:17:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1202.eqiad.wmnet with reason: Maintenance [19:17:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1202 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86676 and previous config saved to /var/cache/conftool/dbconfig/20251216-191759-marostegui.json [19:18:43] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 to 1.46.0-wmf.7 refs T408277 [19:18:47] T408277: 1.46.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T408277 [19:19:35] (03CR) 10Gergő Tisza: [C:03+1] "Tested the command hint and the dashboard link with a recent task and they both work as expected." [alerts] - 10https://gerrit.wikimedia.org/r/1218756 (https://phabricator.wikimedia.org/T412799) (owner: 10Clément Goubert) [19:23:18] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on an-worker1148.eqiad.wmnet with reason: T411919 [19:23:22] T411919: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919 [19:24:38] 10ops-eqiad, 06SRE, 06DC-Ops: hw troubleshooting: PERC1 battery failure for an-worker1148 - https://phabricator.wikimedia.org/T411919#11466095 (10RKemper) >>! In T411919#11465441, @Jclark-ctr wrote: > @rkemper I do not have access to run down time Ah, didn't realize. Okay, I put a downtime on `an-worker114... [19:25:11] (03PS1) 10Milimetric: trafficserver: Send /evt-502b/v2/events to intake-analytics [puppet] - 10https://gerrit.wikimedia.org/r/1218817 (https://phabricator.wikimedia.org/T412863) [19:25:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2200.codfw.wmnet with reason: Maintenance [19:25:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2208.codfw.wmnet with reason: Maintenance [19:26:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2208 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86678 and previous config saved to /var/cache/conftool/dbconfig/20251216-192603-marostegui.json [19:26:09] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:26:09] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:26:48] (03PS11) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) [19:29:05] (03CR) 10Dzahn: unlink wikipedia25.org from ncredir, point to k8s-ingress (034 comments) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:31:53] (03PS1) 10Ahmon Dancy: Update wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218818 [19:33:05] (03CR) 10Ahmon Dancy: [C:03+2] Update wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218818 (owner: 10Ahmon Dancy) [19:33:57] (03Merged) 10jenkins-bot: Update wmf-config hacks for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1218818 (owner: 10Ahmon Dancy) [19:41:54] (03CR) 10Dzahn: [C:03+2] admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:44:11] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:48:30] (03PS12) 10Dzahn: unlink wikipedia25.org from ncredir, point to geoip text-addrs [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) [19:48:38] (03CR) 10Dzahn: unlink wikipedia25.org from ncredir, point to geoip text-addrs (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:49:24] (03Merged) 10jenkins-bot: admin_ng/miscweb: remove donate.wikipedia25.org from tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218813 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [19:55:15] (03CR) 10Dzahn: "realizing now this is just like a cleanup that can happen any time later.. on or after Jan 15 - the DNS change is the only thing that matt" [puppet] - 10https://gerrit.wikimedia.org/r/1216856 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [20:02:06] (03PS1) 10Herron: arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) [20:02:34] (03CR) 10CI reject: [V:04-1] arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) (owner: 10Herron) [20:03:14] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466178 (10Dzahn) >>! In T408592#11466031, @ATitkov wrote: >> Would it be ok with you if we do that next week, on December 22nd? > > Yes,... [20:03:19] (03PS2) 10Herron: arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) [20:05:52] (03CR) 10Dzahn: "It might be considered nicer to just change the 2 relevant lines in an existing zone file.. but since this is currently on ncredir.. it is" [dns] - 10https://gerrit.wikimedia.org/r/1216843 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [20:08:37] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466187 (10Dzahn) Once we have moved the repo to the new location, and with the config for CI to build the docker images that Jelto has alre... [20:10:24] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate config-master.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:10:40] (03CR) 10Dzahn: [C:03+1] arclamp: reduce compress days [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) (owner: 10Herron) [20:12:02] (03CR) 10Herron: [C:03+2] "Thanks @dzahn@wikimedia.org!" [puppet] - 10https://gerrit.wikimedia.org/r/1218821 (https://phabricator.wikimedia.org/T412842) (owner: 10Herron) [20:16:14] !log dzahn@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [20:17:04] !log dzahn@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [20:17:24] I was going to deploy something to admin_ng on k8s but I said NO to the diff. [20:17:34] reason: unrelated changes in my diff. undeployed. [20:18:12] thinking about fully reverting mine or leaving it as it is [20:18:28] afaict there are even 2 different undeployed but merged changes [20:18:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 537000344 and 56 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:19:39] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 2164432 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [20:31:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86680 and previous config saved to /var/cache/conftool/dbconfig/20251216-203153-marostegui.json [20:31:59] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:32:00] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:33:24] (03PS1) 10Eric Gardner: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) [20:36:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner) [20:39:26] (03PS1) 10Kosta Harlan: product_metrics.special_create_account: Collect mediawiki_database [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) [20:40:00] (03CR) 10Clare Ming: [C:03+1] product_metrics.special_create_account: Collect mediawiki_database [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan) [20:40:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 16 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan) [20:40:19] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 60%, RTA = 4073.50 ms [20:40:41] RECOVERY - Host wikikube-worker1275 is UP: PING WARNING - Packet loss = 33%, RTA = 620.39 ms [20:46:55] 06SRE, 06Fundraising-Backlog, 06Fundraising-Tech-Roadmap, 10MediaWiki-extensions-CentralNotice, 06Traffic: Set expiry time for GeoIP cookies - https://phabricator.wikimedia.org/T122097#11466380 (10AKanji-WMF) [20:47:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P86681 and previous config saved to /var/cache/conftool/dbconfig/20251216-204701-marostegui.json [20:58:09] (03CR) 10Phuedx: [C:03+1] "Confirming that this should work. `mediawiki.database AIUI `$wgDBname` is e" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan) [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T2100). [21:00:05] JSherman, tgr, EricGardner, and kostajh: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:15] \o [21:00:17] o/ [21:00:24] (03CR) 10Phuedx: [C:03+1] "Sorry." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan) [21:00:38] my patch is a noop, feel free to bundle it with something else [21:00:40] i'm here [21:00:44] same for mine [21:01:20] ah, very good [21:01:56] Those are also config, so they should go relatively fast [21:02:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202', diff saved to https://phabricator.wikimedia.org/P86682 and previous config saved to /var/cache/conftool/dbconfig/20251216-210210-marostegui.json [21:02:16] I can wait until other patches are done to deploy mine (which is just an instrumentation change) [21:02:29] I'm happy to deploy if we don't have another deployer on hand [21:02:51] EricGardener: mine is really low risk, maybe we could bundle ours together to save time? [21:02:56] Sure, sounds good [21:03:21] okay, I'll start with tgr_: and kostajh: together [21:03:32] thanks [21:04:51] kostajh: it didn't want to let me bundle yours; I'll to tgr_ and then you [21:06:07] oh, actually it was yours, tgr_: [21:06:07] > Error for Change '1217790', project: 'operations/mediawiki-config', branch: 'master': [21:06:07] Change '1217790' has dependency '1203252' targeting the master branch [21:06:07] of MediaWiki code project 'mediawiki/core', but the dependency is not [21:06:07] present in live train branch: wmf/1.46.0-wmf.5 [21:06:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan) [21:07:29] (03Merged) 10jenkins-bot: product_metrics.special_create_account: Collect mediawiki_database [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218828 (https://phabricator.wikimedia.org/T412866) (owner: 10Kosta Harlan) [21:07:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 42927688 and 4 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:08:02] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]] [21:08:06] T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866 [21:08:12] hm I suppose scap is correct on that [21:08:39] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3253376 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:08:41] it's not really dependent on the other patch, I just wanted to link to it [21:08:52] in any case, I can just wait until Thursday [21:09:31] kostajh: is there any testing to do for yours, or just move on if it deploys happily? [21:09:55] tgr_: ack [21:10:22] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 18 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217790 (https://phabricator.wikimedia.org/T142542) (owner: 10Gergő Tisza) [21:11:34] JSherman: you can just sync it [21:12:05] kostajh: ack [21:13:06] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218830 [21:13:39] (03CR) 10BBlack: [C:03+1] "I would just go ahead and remove all 3 in one patch really, but perhaps check turnilo to see if we have any recent samples of matching tra" [puppet] - 10https://gerrit.wikimedia.org/r/1215329 (owner: 10Dzahn) [21:13:57] (03PS4) 10C. Scott Ananian: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:13:57] (03PS4) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:15:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 232649800 and 59 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:16:39] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 0 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:17:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1202 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86683 and previous config saved to /var/cache/conftool/dbconfig/20251216-211718-marostegui.json [21:17:25] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:17:25] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:17:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1227.eqiad.wmnet with reason: Maintenance [21:17:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1227 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86684 and previous config saved to /var/cache/conftool/dbconfig/20251216-211743-marostegui.json [21:19:33] (03PS5) 10C. Scott Ananian: Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:19:33] (03PS5) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:27:07] (03CR) 10C. Scott Ananian: [C:03+1] Activate post-processing cache on some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218806 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:30:39] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 1280712320 and 90 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:31:39] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 3731512 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [21:33:19] we're waiting still on the "building container images" step; no errors at this time. [21:33:59] (03PS6) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:34:07] (03CR) 10C. Scott Ananian: Enable post-processing cache for all Parsoid-rendered wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:35:02] (03CR) 10C. Scott Ananian: [C:03+1] Enable post-processing cache for all Parsoid-rendered wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218793 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:38:28] (03PS1) 10Bking: opensearch-cluster: Replace reload certificates API call with hot reload setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1218834 (https://phabricator.wikimedia.org/T412447) [21:39:25] JSherman: It will take a long time due to the localisation rebuild. [21:40:04] dancy: ack [21:40:21] And pushing the image to the registry might be sketchy (T412265). Fingers crossed! [21:40:22] T412265: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265 [21:41:27] yeah, I saw that happen in a window last week; hoping we're just taking our time on that i18n cache build! [21:43:58] brb [21:45:42] EricGardner: ack; we might be able to overrun into the next window as it's noted as often skipped. I won't be able to stay for that whole window though. [21:45:52] 06SRE, 06collaboration-services, 13Patch-For-Review, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11466529 (10Dzahn) [21:46:58] I can stay for that window too [21:47:26] (theoretically that window belongs to my team and the new reader experiences team anyway, since web team is no more) [21:48:03] I didn't know who inherited it! [21:48:26] Yeah I suppose we should update that on the deployments page at some point [21:58:21] !log mwscript-k8s --follow -- findBadBlobs.php --wiki elwiki --mark "Corrupted UTF-8 (T351953)" --revisions 26381,30551 (T351953) [21:58:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:25] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251216T2200) [22:02:55] !log jsn@deploy2002 sync-world failed: Command 'sudo -u mwbuilder /srv/mwbuilder/release/make-container-image/build-images.py --http-proxy http://webproxy:8080 --https-proxy http://webproxy:8080 /srv/mediawiki-staging/scap/image-build --staging-dir /srv/mediawiki-staging --mediawiki-versions 1.46.0-wmf.5,1.46.0-wmf.7,next --multiversion-image-basename docker-registry.discovery.wmnet/restricted/mediawi [22:02:55] ki-multiversion --singleversion-image-basename docker-registry.discovery.wmnet/restricted/mediawiki-singleversion --webserver-image-name docker-registry.discovery.wmnet/restricted/mediawiki-webserver --latest-tag latest --label vnd.wikimedia.builder.name=scap --label vnd.wikimedia.builder.version=4.229.0 --label vnd.wikimedia.scap.stage_dir=/srv/mediawiki-staging --label vnd.wikimedia.scap.build_state_dir=/srv/mediawiki-s [22:02:55] taging/scap/image-build' returned non-zero exit status 1. (scap version: 4.229.0) (duration: 54m 53s) [22:04:09] Well, that failed [22:06:03] JSherman: if it failed when trying to upload the docker image, possibly https://phabricator.wikimedia.org/T412265 [22:07:19] Yeah, I guess the question is, what to do now; revert? [22:07:39] 06SRE, 06Infrastructure-Foundations: Dell R740xd reimage fails in debian-installer, configures IP on incorrect interface - https://phabricator.wikimedia.org/T412807#11466610 (10Jhancock.wm) time delayed reply. @cmooney BIOS > integrated devices > (pick appropriate interface) > NIC configuration > Legacy Boot... [22:10:03] JSherman: dont have a good answer but maybe "try it one more time" and if you can repeat it.. THEN revert [22:12:42] I'll give it a shot [22:13:41] !log jsn@deploy2002 Started scap sync-world: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]] [22:13:45] T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866 [22:19:55] PROBLEM - Host titan1002 is DOWN: PING CRITICAL - Packet loss = 100% [22:20:39] RECOVERY - Host titan1002 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [22:22:21] PROBLEM - Host wikikube-worker1275 is DOWN: PING CRITICAL - Packet loss = 100% [22:23:29] RECOVERY - Host wikikube-worker1275 is UP: PING OK - Packet loss = 0%, RTA = 0.30 ms [22:28:46] JSherman: would you also mind noting that on the task what happened? even if it passes on the second try, it makes sense to have it recorded. [22:29:11] FIRING: [2x] ErrorBudgetBurn: xlab-standalone-event-system-success-rate-v1 - https://slo.wikimedia.org/?search=xlab-standalone-event-system-success-rate-v1 - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [22:29:13] urbanecm: ack [22:29:18] ty [22:29:37] (03CR) 10Urbanecm: [C:03+1] "LGTM, thanks for your work on this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [22:33:43] 06SRE, 10MW-on-K8s, 06serviceops: Pushing to the docker registry fails with 500 Internal Server Error - https://phabricator.wikimedia.org/T412265#11466660 (10jsn.sherman) This happened again in the UTC late backport window: https://sal.toolforge.org/log/5jwwKZsBvg159pQrFeSI https://spiderpig.wikimedia.org/j... [22:40:49] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-test: apply [22:41:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/opensearch-test: apply [22:41:42] (03PS1) 10Robertsky: lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) [22:42:35] (03CR) 10CI reject: [V:04-1] lift throttle limits for Sing Lit 2025 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [22:46:47] (03PS8) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [22:47:35] (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [22:47:41] (03CR) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [22:47:54] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, December 17 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#dep" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1218853 (https://phabricator.wikimedia.org/T412820) (owner: 10Robertsky) [22:50:04] (03PS9) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [22:50:45] !log jsn@deploy2002 kharlan, jsn: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [22:50:49] T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866 [22:50:52] (03CR) 10CI reject: [V:04-1] Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) (owner: 10Pppery) [22:51:15] !log jsn@deploy2002 kharlan, jsn: Continuing with sync [22:51:41] (03PS10) 10Pppery: Logos: Handle missing responsive URLs, manually modify thumbnail sizes to avoid $wgThumbnailSteps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1217282 (https://phabricator.wikimedia.org/T405169) [22:57:34] JSherman: are you still waiting on this task to complete? [22:59:20] EricGardner: The last phase of the deployment is still in progress. 38% done [23:00:36] dancy: thanks! I will stay tuned I guess [23:00:50] EricGardner: yep, didn't expect the one to take 2 hrs! [23:01:51] I will absolutely have to drop after this completes [23:02:24] JSherman: yeah, last time i run into this, i spent ~4 hrs in total (two attempts and a revert) :/. i hope that's not the case here. [23:02:52] but it has built, which is good [23:03:15] Crossing my fingers here at ~90% [23:04:26] !log jsn@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218828|product_metrics.special_create_account: Collect mediawiki_database (T412866)]] (duration: 50m 45s) [23:04:30] T412866: product_metrics.special_create_account: Record the wiki used on action=submit - https://phabricator.wikimedia.org/T412866 [23:04:43] Finished! [23:04:57] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for dr0ptp4kt - https://phabricator.wikimedia.org/T412875 (10dr0ptp4kt) 03NEW [23:05:20] EricGardner: I'm really sorry we got bumped [23:06:31] JSherman: No prob – are you sticking around to backport your patch now? I may be able to do both of ours if you have to go [23:07:50] I have to drop, so that would be great [23:08:38] Mine is just adding an extra data attribute for a future change, so it shouldn't have any impact [23:17:54] Ok. If no one here objects, I will proceed with deploying JSherman's patch (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1218776) as well as my own (https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1218825) since we got bumped out of our window earlier [23:19:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) (owner: 10Jsn.sherman) [23:19:31] (03CR) 10TrainBranchBot: [C:03+2] "Approved by egardner@deploy2002 using scap backport" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner) [23:23:31] (03Merged) 10jenkins-bot: [Moderator tools] Add data-mw-interface in addition to data-mw="interface" [core] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218776 (https://phabricator.wikimedia.org/T409187) (owner: 10Jsn.sherman) [23:27:39] (03Merged) 10jenkins-bot: Delay StickyHeaders section click instrumentation for slow loads [extensions/WikimediaEvents] (wmf/1.46.0-wmf.7) - 10https://gerrit.wikimedia.org/r/1218825 (https://phabricator.wikimedia.org/T412857) (owner: 10Eric Gardner) [23:28:18] !log egardner@deploy2002 Started scap sync-world: Backport for [[gerrit:1218776|[Moderator tools] Add data-mw-interface in addition to data-mw="interface" (T409187)]], [[gerrit:1218825|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] [23:28:23] T409187: The `data-mw` attribute should be reserved for Parsoid use; rename data-mw="interface" to data-mw-interface - https://phabricator.wikimedia.org/T409187 [23:28:24] T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857 [23:32:30] !log egardner@deploy2002 jsn, egardner: Backport for [[gerrit:1218776|[Moderator tools] Add data-mw-interface in addition to data-mw="interface" (T409187)]], [[gerrit:1218825|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:34:08] !log egardner@deploy2002 jsn, egardner: Continuing with sync [23:40:04] !log egardner@deploy2002 Finished scap sync-world: Backport for [[gerrit:1218776|[Moderator tools] Add data-mw-interface in addition to data-mw="interface" (T409187)]], [[gerrit:1218825|Delay StickyHeaders section click instrumentation for slow loads (T412857)]] (duration: 11m 47s) [23:40:10] T409187: The `data-mw` attribute should be reserved for Parsoid use; rename data-mw="interface" to data-mw-interface - https://phabricator.wikimedia.org/T409187 [23:40:10] T412857: Sticky Headers: Distinguish automatic vs user-initiated section toggles - https://phabricator.wikimedia.org/T412857 [23:40:48] JSherman: your patch is deployed [23:45:24] FIRING: PuppetFailure: Puppet has failed on ml-build1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:53:29] (03CR) 10RLazarus: [C:03+2] kubernetes: Set default Envoy version to 1.35.7 [puppet] - 10https://gerrit.wikimedia.org/r/1218799 (https://phabricator.wikimedia.org/T410975) (owner: 10RLazarus)