[00:06:36] (03CR) 10Cwhite: [C:03+1] service::catalog: add 'team' attribute [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [00:21:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:24:44] (03CR) 10Cwhite: Followup I81a2c4de77: Verify stats label values are not empty (031 comment) [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester) [00:30:39] (03CR) 10Catrope: [C:03+1] OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [00:31:12] PROBLEM - dump of s5 in codfw on backupmon1001 is CRITICAL: dump for s5 at codfw (db2201) taken more than a week ago: Most recent backup 2025-11-25 00:00:05 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:31:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214682 [00:40:20] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214682 (owner: 10TrainBranchBot) [00:49:02] (03CR) 10Cwhite: [C:03+1] Blackbox/check: strengthen suffix matching regex in generated rules [puppet] - 10https://gerrit.wikimedia.org/r/1208365 (https://phabricator.wikimedia.org/T410745) (owner: 10Tiziano Fogli) [00:49:38] (03CR) 10Cwhite: [C:03+1] sre: multi-team ProbeDown [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [00:51:27] (03CR) 10Dzahn: [C:03+1] Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [00:51:42] (03CR) 10Andrea Denisse: [C:03+2] Add astein to analytics-privatedata-users. [puppet] - 10https://gerrit.wikimedia.org/r/1214658 (https://phabricator.wikimedia.org/T411679) (owner: 10Andrea Denisse) [00:53:19] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1214682 (owner: 10TrainBranchBot) [00:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:00:40] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:06:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:10:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214690 [01:10:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214690 (owner: 10TrainBranchBot) [01:18:28] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 17m 47s) [01:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [01:23:14] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1157.eqiad.wmnet with reason: Maintenance [01:23:22] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T410589)', diff saved to https://phabricator.wikimedia.org/P86394 and previous config saved to /var/cache/conftool/dbconfig/20251204-012321-ladsgroup.json [01:23:25] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [01:32:06] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1214690 (owner: 10TrainBranchBot) [01:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [01:41:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:46:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:49:20] (03PS1) 10Pppery: Remove old list of translated languages [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214701 [01:56:46] (03PS1) 10Pppery: Add .gitreview [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214702 [01:57:04] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [01:57:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, 445419311 80ms 10Gbps wave) {#2013}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [01:58:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [01:58:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [01:58:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:02:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-esams:xe-0/0/7 (Transport: cr2-eqiad:xe-3/2/1 (Colt, 445419311 80ms 10Gbps wave) {#30385}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:03:10] RESOLVED: [2x] BFDdown: BFD session down between cr2-eqiad and 185.15.59.149 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [02:03:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr1-esams and cr2-eqiad (185.15.59.148) - group Confed_eqiad - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:10:22] (03PS1) 10Pppery: Replace "libphutil" with "Arcanist" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1214708 [02:29:42] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:29:42] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:39] FIRING: CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (2a02:ec80:700:fe0b::2) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:31:42] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:31:42] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [02:36:39] RESOLVED: [2x] CoreBGPDown: Core BGP session down between cr2-eqdfw and cr2-magru (195.200.68.153) - group Confed_magru - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=cr2-eqdfw:9804&var-bgp_group=Confed_magru&var-bgp_neighbor=cr2-magru - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [02:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:25:53] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [04:33:10] FIRING: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:38:10] RESOLVED: BFDdown: BFD session down between cr2-magru and fe80::ee38:73ff:fee8:9c58 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-magru:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [04:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:56:12] PROBLEM - snapshot of s3 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s3 at eqiad (db1150) taken on 2025-12-04 04:06:37 is 872 GiB, but the previous one was 1145 GiB, a change of -23.8 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:18:37] (03PS1) 10Jdlrobson: Filter another client adding noise [puppet] - 10https://gerrit.wikimedia.org/r/1214759 [05:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [05:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [05:51:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:56:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [05:58:20] (03PS1) 10Marostegui: installserver: Add UEFI recipe to future clouddb* [puppet] - 10https://gerrit.wikimedia.org/r/1214777 [06:00:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06cloud-services-team (Hardware): Q2:rack/setup/install clouddb1026-1033 - https://phabricator.wikimedia.org/T409162#11431559 (10Marostegui) [06:08:10] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11431564 (10Marostegui) p:05Triage→03Medium [06:12:17] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:25:38] (03CR) 10Tiziano Fogli: [C:03+1] service::catalog: add 'team' attribute [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [06:27:19] 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11431570 (10RKemper) an-worker* partially done. made https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1214664 to al... [06:28:25] 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11431572 (10RKemper) Oh, with respect to the patch, we should also get https://gerrit.wikimedia.org/r/c/operations/cookboo... [06:30:08] (03CR) 10Tiziano Fogli: [C:03+1] sre: multi-team ProbeDown [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [06:53:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:57:17] FIRING: [5x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:59:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T0700) [07:00:05] marostegui, Amir1, and federico3: #bothumor I � Unicode. All rise for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T0700). [07:02:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:03:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:08:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:09:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:13:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:14:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:19:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:25:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [07:27:21] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/1214556 (https://phabricator.wikimedia.org/T311407) (owner: 10Muehlenhoff) [07:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:35:56] (03PS1) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) [07:36:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1214665 (https://phabricator.wikimedia.org/T411730) (owner: 10Brennen Bearnes) [07:36:43] (03CR) 10Muehlenhoff: [C:03+2] admin: add fido backed ssh key for brennen [puppet] - 10https://gerrit.wikimedia.org/r/1214665 (https://phabricator.wikimedia.org/T411730) (owner: 10Brennen Bearnes) [07:40:32] (03PS2) 10Arthur taylor: Enable the MEX / wbui2025 beta feature on wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) [07:43:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2012:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:46:15] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Setup a maps staging DB - https://phabricator.wikimedia.org/T409528#11431623 (10MoritzMuehlenhoff) The initial imposm catchup sync after the PBF import has just completed. [07:48:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2012:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [07:51:48] (03PS2) 10Giuseppe Lavagetto: kubernetes::deployment_server: add files for configuring conftool [puppet] - 10https://gerrit.wikimedia.org/r/1214524 [07:55:58] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:00:05] Amir1, Urbanecm, and awight: OwO what's this, a deployment window?? UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T0800). nyaa~ [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:02:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:07:17] FIRING: [16x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:09:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:09:14] (03PS1) 10Muehlenhoff: Remove platform-engineering POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215058 [08:10:00] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [08:10:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:10:54] (03PS1) 10Gehel: Druid: open firewall access to Druid from the FRTech network [puppet] - 10https://gerrit.wikimedia.org/r/1215059 (https://phabricator.wikimedia.org/T411740) [08:10:57] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:12:17] FIRING: [20x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:13:21] (03PS1) 10Zoranzoki21: Add Serbian Latin draft namespace and talk namespace aliases [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215060 (https://phabricator.wikimedia.org/T411750) [08:14:53] (03PS1) 10Muehlenhoff: Remove piwik-roots POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215061 [08:14:54] Hi, is someone here to deploy one mediawiki-config patch, as the backport window is ongoing right now? [08:14:59] Or I should add it for the next one? [08:16:20] urbanecm? [08:16:24] urbanecm: ? [08:16:37] (03PS2) 10Gehel: Druid: open firewall access to Druid from the FRTech network [puppet] - 10https://gerrit.wikimedia.org/r/1215059 (https://phabricator.wikimedia.org/T411740) [08:16:58] Tag is not working for some reason.. Nevermind, I'll add it for the next deployment window.. [08:17:17] FIRING: [20x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:17:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215060 (https://phabricator.wikimedia.org/T411750) (owner: 10Zoranzoki21) [08:17:54] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215059 (https://phabricator.wikimedia.org/T411740) (owner: 10Gehel) [08:18:28] (03PS1) 10Slyngshede: data.yaml: Offboarding arinaigum [puppet] - 10https://gerrit.wikimedia.org/r/1215063 [08:19:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:19:13] (03CR) 10CI reject: [V:04-1] data.yaml: Offboarding arinaigum [puppet] - 10https://gerrit.wikimedia.org/r/1215063 (owner: 10Slyngshede) [08:21:38] (03CR) 10Jelto: "I was not aware of Ie50e2f89b0dddd62e7206dff185545e0242fa6a5, we can use your patch and add the ssh service later on." [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:21:44] (03Abandoned) 10Jelto: service::catalog: add gerrit-https and gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [08:22:17] FIRING: [20x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:22:55] (03PS2) 10Slyngshede: data.yaml: Offboarding arinaigum [puppet] - 10https://gerrit.wikimedia.org/r/1215063 [08:23:29] (03PS1) 10Muehlenhoff: Remove notebook-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215065 [08:23:39] (03CR) 10CI reject: [V:04-1] data.yaml: Offboarding arinaigum [puppet] - 10https://gerrit.wikimedia.org/r/1215063 (owner: 10Slyngshede) [08:23:53] (03PS3) 10Gehel: Druid: open firewall access to Druid from the FRTech network [puppet] - 10https://gerrit.wikimedia.org/r/1215059 (https://phabricator.wikimedia.org/T411740) [08:24:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:24:11] (03CR) 10Gehel: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215059 (https://phabricator.wikimedia.org/T411740) (owner: 10Gehel) [08:25:36] (03PS3) 10Slyngshede: data.yaml: Offboarding arinaigum [puppet] - 10https://gerrit.wikimedia.org/r/1215063 [08:27:02] (03PS1) 10Muehlenhoff: Remove labnet-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215067 [08:27:16] (03CR) 10Slyngshede: [C:03+2] Phabricator: Allow users to link Phabricator and developer accounts [software/bitu] - 10https://gerrit.wikimedia.org/r/1196919 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [08:27:17] (03CR) 10Jelto: service: add gerrit-https service to service catalog (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [08:27:49] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2021:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:28:09] (03CR) 10Slyngshede: [C:03+1] Remove platform-engineering POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215058 (owner: 10Muehlenhoff) [08:28:34] (03CR) 10Slyngshede: [C:03+1] Remove piwik-roots POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215061 (owner: 10Muehlenhoff) [08:30:18] (03Merged) 10jenkins-bot: Phabricator: Allow users to link Phabricator and developer accounts [software/bitu] - 10https://gerrit.wikimedia.org/r/1196919 (https://phabricator.wikimedia.org/T406495) (owner: 10Slyngshede) [08:30:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:30:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:31:10] (03PS1) 10Muehlenhoff: Remove eventbus-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215068 [08:32:17] FIRING: [18x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:32:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2021:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:33:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:34:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:34:54] PROBLEM - Check unit status of statograph_post on alert1002 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:35:00] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:35:16] ^^ looking at thanos [08:35:38] thanks [08:35:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1215063 (owner: 10Slyngshede) [08:35:48] (03PS1) 10Muehlenhoff: Remove gpu-testers POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215069 [08:36:20] hnowlan: o/ sup [08:36:52] yoyo [08:37:10] hmm, looks like my bouncer is broken not replaying messages :/ [08:37:25] was the p.age just a fart? [08:38:06] (03CR) 10Muehlenhoff: [C:03+2] Remove platform-engineering POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215058 (owner: 10Muehlenhoff) [08:38:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:38:59] not sure, still looking - definitely see a big increase in most metrics on the thanos hosts [08:39:15] so kinda looks like a big query? [08:39:43] (03CR) 10Slyngshede: [C:03+1] Remove labnet-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215067 (owner: 10Muehlenhoff) [08:39:57] I've depooled titan2001 [08:40:00] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:40:11] (03CR) 10Slyngshede: [C:03+1] Remove eventbus-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215068 (owner: 10Muehlenhoff) [08:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:40:36] (03CR) 10Slyngshede: [C:03+2] data.yaml: Offboarding arinaigum [puppet] - 10https://gerrit.wikimedia.org/r/1215063 (owner: 10Slyngshede) [08:40:50] The disk is probably full due to the compactor. [08:41:01] tappof: would that cause extra load on thanos hosts? [08:41:06] tappof: fwiw it looks like the issue was in eqiad [08:41:29] https://grafana.wikimedia.org/goto/XDjaRsWvR?orgId=1 [08:41:34] yeah, right [08:41:47] (03CR) 10Slyngshede: [C:03+1] Remove gpu-testers POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215069 (owner: 10Muehlenhoff) [08:42:02] (03PS1) 10Muehlenhoff: Stop applying the os-installers group on cumin* and cloudcumin* nodes [puppet] - 10https://gerrit.wikimedia.org/r/1215073 (https://phabricator.wikimedia.org/T358361) [08:42:17] FIRING: [14x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:43:54] Well, I need to check, but IIRC, the Query Frontend is spreading queries across all the instances, including titan2001, which has its /srv partition 100% full. I'm checking.. [08:44:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:44:13] tappof: ah, okay [08:44:27] but cross-dc? [08:44:54] RECOVERY - Check unit status of statograph_post on alert1002 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:46:10] either way that disk free is worrying yeah [08:47:04] given that the page has resolved I'm going afk for a little bit, but I'm nearby so message if needed [08:47:12] ack [08:54:29] (03CR) 10Arnaudb: "yes both are useful, I will not merge 1211551 until we are successfully switched over and this one will be merged right before" [puppet] - 10https://gerrit.wikimedia.org/r/1196792 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [08:55:44] (03CR) 10Arnaudb: "indeed! I'll add that to our team meeting agenda. thanks for raising that concern!" [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [08:58:29] I am looking at the backend error logs before running the train [08:58:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:00:04] hashar and jnuche: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T0900) [09:02:43] (03CR) 10Hashar: Followup I81a2c4de77: Verify stats label values are not empty (031 comment) [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester) [09:04:05] (03PS1) 10Hashar: REST: add explicit cast to sitemapSize calcuation to avoid warning [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215078 (https://phabricator.wikimedia.org/T411580) [09:04:49] (03CR) 10Hashar: [C:03+2] "That is trivial enough for a deployment and will cut log spam 😊" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215078 (https://phabricator.wikimedia.org/T411580) (owner: 10Hashar) [09:04:58] (03CR) 10Hashar: [C:03+2] Followup I81a2c4de77: Verify stats label values are not empty [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester) [09:05:19] some backports to cut on the log spam [09:06:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215078 (https://phabricator.wikimedia.org/T411580) (owner: 10Hashar) [09:06:05] (03CR) 10TrainBranchBot: [C:03+2] "Approved by hashar@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester) [09:06:13] hmm [09:06:25] I guess it does not matter to have multiple CR+2 [09:06:43] (03CR) 10Slyngshede: [C:03+1] Remove notebook-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215065 (owner: 10Muehlenhoff) [09:08:56] (03CR) 10Filippo Giunchedi: [C:03+1] Remove labnet-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215067 (owner: 10Muehlenhoff) [09:09:35] (03CR) 10Majavah: [C:03+2] openstack: puppet: Remove support for X-Enc-Edit-Git [puppet] - 10https://gerrit.wikimedia.org/r/1214490 (owner: 10Majavah) [09:10:51] (03PS1) 10Brouberol: growthbook-next: add stub OIDC client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1215081 (https://phabricator.wikimedia.org/T411752) [09:12:08] (03PS1) 10Brouberol: growthbook: setup OIDC for both the production and next instance [puppet] - 10https://gerrit.wikimedia.org/r/1215082 (https://phabricator.wikimedia.org/T411752) [09:13:46] (03CR) 10Jelto: [C:03+1] gerrit: re-enable backups on gerrit2003 [puppet] - 10https://gerrit.wikimedia.org/r/1211551 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [09:15:12] (03PS1) 10Brouberol: growthbook: grant frontend access to the IDP servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215083 (https://phabricator.wikimedia.org/T411752) [09:15:50] (03Abandoned) 10Slyngshede: C:tomcat10 hide stacktrace and server info [puppet] - 10https://gerrit.wikimedia.org/r/1207874 (owner: 10Slyngshede) [09:16:05] > but cross-dc? [09:16:25] Yes jayme hnowlan, for the Ruler component. Anyway, I found a "query of death" that started at 8:15, requesting 45 days of data for 4,400 series. [09:17:08] tappof: ok, thanks! [09:17:49] (03PS2) 10Brouberol: growthbook: grant frontend access to the IDP servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215083 (https://phabricator.wikimedia.org/T411752) [09:19:06] (03Merged) 10jenkins-bot: REST: add explicit cast to sitemapSize calcuation to avoid warning [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215078 (https://phabricator.wikimedia.org/T411580) (owner: 10Hashar) [09:19:12] (03Merged) 10jenkins-bot: Followup I81a2c4de77: Verify stats label values are not empty [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1214647 (https://phabricator.wikimedia.org/T411585) (owner: 10Jforrester) [09:19:39] !log slyngshede@cumin1003 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Arinaigum out of all services on: 2419 hosts [09:20:05] !log upgrade envoyproxy on vrts T405808 [09:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:20:08] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [09:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:20:30] !log hashar@deploy2002 Started scap sync-world: Backport for [[gerrit:1215078|REST: add explicit cast to sitemapSize calcuation to avoid warning (T411580)]], [[gerrit:1214647|Followup I81a2c4de77: Verify stats label values are not empty (T411585)]] [09:20:34] T411580: SitemapFileHandler: PHP Deprecated: Implicit conversion from float 33333.333333333336 to int loses precision - https://phabricator.wikimedia.org/T411580 [09:20:35] T411585: PHP Warning: Stats: (action_api_modules_hit_total): Stats: (action_api_modules_hit_total) Cannot associate label keys with label values - Not all initialized labels have an assigned value. - https://phabricator.wikimedia.org/T411585 [09:21:29] jayme: hnowlan The overlapping alert for disk saturation was just a matter of unlucky timing: I tried depooling titan2001 because I was blind—neither Grafana nor Thanos were working. A few seconds later, it started replying again, so I put the blame on titan2001… but the outage was definitely due to a “query of death” on eqiad. [09:21:51] tappof: good to know, thanks! [09:22:16] !log upgrade envoyproxy on lists T405808 [09:22:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:00] !log hashar@deploy2002 jforrester, hashar: Backport for [[gerrit:1215078|REST: add explicit cast to sitemapSize calcuation to avoid warning (T411580)]], [[gerrit:1214647|Followup I81a2c4de77: Verify stats label values are not empty (T411585)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [09:26:19] !log hashar@deploy2002 jforrester, hashar: Continuing with sync [09:26:37] (03PS11) 10AOkoth: vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) [09:27:06] (03CR) 10AOkoth: vrts: add high inode usage alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [09:29:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:30:19] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11431822 (10ayounsi) From that comment : T410989#11429115 cloudcephosd1052 still needs to be migrated. Both interfaces are still doing significant traffic : https://libre... [09:30:29] !log hashar@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215078|REST: add explicit cast to sitemapSize calcuation to avoid warning (T411580)]], [[gerrit:1214647|Followup I81a2c4de77: Verify stats label values are not empty (T411585)]] (duration: 09m 59s) [09:30:33] T411580: SitemapFileHandler: PHP Deprecated: Implicit conversion from float 33333.333333333336 to int loses precision - https://phabricator.wikimedia.org/T411580 [09:30:34] T411585: PHP Warning: Stats: (action_api_modules_hit_total): Stats: (action_api_modules_hit_total) Cannot associate label keys with label values - Not all initialized labels have an assigned value. - https://phabricator.wikimedia.org/T411585 [09:30:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [09:31:22] (03CR) 10AOkoth: "Yeah, I think we can." [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [09:31:44] I am doing the train now [09:32:07] (03PS1) 10TrainBranchBot: group2 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215085 (https://phabricator.wikimedia.org/T408275) [09:32:10] (03CR) 10TrainBranchBot: [C:03+2] "Initiated by hashar@deploy2002" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215085 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:32:58] (03Merged) 10jenkins-bot: group2 to 1.46.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215085 (https://phabricator.wikimedia.org/T408275) (owner: 10TrainBranchBot) [09:33:30] (03CR) 10Ayounsi: [C:03+2] Tox: remove old python support [cookbooks] - 10https://gerrit.wikimedia.org/r/1214532 (owner: 10Ayounsi) [09:33:38] (03CR) 10Ayounsi: [C:03+2] sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [09:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [09:35:34] !log cleanup lingering sessions of offboarded user T389324 [09:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:37] T389324: /etc/wikimedia/logout.d/50-systemdlogoutd sometimes fails to terminate user session on stat hosts - https://phabricator.wikimedia.org/T389324 [09:37:04] (03CR) 10Jelto: [C:03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [09:37:45] (03PS1) 10Elukey: Move ml-serve1013 to a ML k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1215088 (https://phabricator.wikimedia.org/T403697) [09:38:43] (03Merged) 10jenkins-bot: Tox: remove old python support [cookbooks] - 10https://gerrit.wikimedia.org/r/1214532 (owner: 10Ayounsi) [09:38:54] (03CR) 10Elukey: [C:03+1] Remove gpu-testers POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215069 (owner: 10Muehlenhoff) [09:39:00] (03Merged) 10jenkins-bot: sre.network.tls: add timeout to get_server_certificate [cookbooks] - 10https://gerrit.wikimedia.org/r/1161337 (owner: 10Ayounsi) [09:39:21] (03CR) 10Elukey: [C:03+1] Remove piwik-roots POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215061 (owner: 10Muehlenhoff) [09:39:33] (03CR) 10Elukey: [C:03+1] Remove notebook-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215065 (owner: 10Muehlenhoff) [09:39:47] (03CR) 10Elukey: [C:03+1] Remove labnet-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215067 (owner: 10Muehlenhoff) [09:40:00] (03CR) 10Elukey: [C:03+1] Remove eventbus-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215068 (owner: 10Muehlenhoff) [09:43:01] !log hashar@deploy2002 rebuilt and synchronized wikiversions files: group2 to 1.46.0-wmf.5 refs T408275 [09:43:05] T408275: 1.46.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T408275 [09:43:22] (03CR) 10AOkoth: [C:03+2] vrts: re-enable cache cleanup timer [puppet] - 10https://gerrit.wikimedia.org/r/1214129 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [09:43:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:43:54] (03PS1) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215089 (https://phabricator.wikimedia.org/T409528) [09:44:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:46:33] 06SRE, 06Infrastructure-Foundations: /etc/wikimedia/logout.d/50-systemdlogoutd sometimes fails to terminate user session on stat hosts - https://phabricator.wikimedia.org/T389324#11431872 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [09:48:56] (03CR) 10Filippo Giunchedi: [C:03+2] service::catalog: add 'team' attribute [puppet] - 10https://gerrit.wikimedia.org/r/1214473 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [09:48:58] !log upgrade Envoy on an-launcher T405808 [09:49:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:01] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [09:50:09] it looks quiet [09:50:33] there is some warning raised but that is more or less the same as T411585 [09:50:34] T411585: PHP Warning: Stats: (action_api_modules_hit_total): Stats: (action_api_modules_hit_total) Cannot associate label keys with label values - Not all initialized labels have an assigned value. - https://phabricator.wikimedia.org/T411585 [09:50:43] PHP Warning: Stats: (action_api_modules_latency) Cannot add labels to a metric containing samples [09:50:52] I'll update the task after a coffee break [09:50:55] (03PS2) 10Muehlenhoff: Remove piwik-roots POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215061 [09:53:33] (03CR) 10Muehlenhoff: [C:03+2] Remove piwik-roots POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215061 (owner: 10Muehlenhoff) [09:53:58] (03PS2) 10Muehlenhoff: Remove notebook-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215065 [09:55:33] (03PS1) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) [09:57:20] (03Abandoned) 10Elukey: services: enable ingress for Kartotherian [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133389 (https://phabricator.wikimedia.org/T391457) (owner: 10Elukey) [09:57:47] (03Abandoned) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214526 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [10:00:21] (03CR) 10Klausman: [C:03+1] Remove gpu-testers POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215069 (owner: 10Muehlenhoff) [10:01:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:04:25] (03CR) 10Btullis: [C:03+1] growthbook-next: add stub OIDC client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1215081 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:04:52] (03CR) 10Filippo Giunchedi: "Chatted with Tobias, below my recommendation:" [puppet] - 10https://gerrit.wikimedia.org/r/1214530 (https://phabricator.wikimedia.org/T394778) (owner: 10Klausman) [10:05:08] (03CR) 10Btullis: [C:03+1] growthbook: setup OIDC for both the production and next instance [puppet] - 10https://gerrit.wikimedia.org/r/1215082 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:05:33] (03CR) 10Btullis: [C:03+1] growthbook: grant frontend access to the IDP servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215083 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:05:39] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [homer/public] - 10https://gerrit.wikimedia.org/r/1214537 (https://phabricator.wikimedia.org/T407959) (owner: 10Ayounsi) [10:08:47] (03CR) 10JMeybohm: services: add maps-next.w.o as FQDN for kartotherian staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [10:09:25] (03CR) 10Brouberol: [C:03+2] growthbook-next: add stub OIDC client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1215081 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:09:27] (03CR) 10Brouberol: [V:03+2 C:03+2] growthbook-next: add stub OIDC client secret [labs/private] - 10https://gerrit.wikimedia.org/r/1215081 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:09:44] (03CR) 10Brouberol: [C:03+2] growthbook: grant frontend access to the IDP servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215083 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:09:52] (03CR) 10Brouberol: [C:03+2] growthbook: setup OIDC for both the production and next instance [puppet] - 10https://gerrit.wikimedia.org/r/1215082 (https://phabricator.wikimedia.org/T411752) (owner: 10Brouberol) [10:12:54] (03CR) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [10:14:41] (03CR) 10Filippo Giunchedi: [C:03+2] sre: multi-team ProbeDown [alerts] - 10https://gerrit.wikimedia.org/r/1214478 (https://phabricator.wikimedia.org/T399807) (owner: 10Filippo Giunchedi) [10:19:28] (03PS1) 10Jelto: sre.gitlab.upgrade: mask ldap group sync during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1215111 (https://phabricator.wikimedia.org/T411240) [10:20:49] (03CR) 10Gehel: [C:04-1] "Probably a bad idea given that this would open all of druid." [puppet] - 10https://gerrit.wikimedia.org/r/1215059 (https://phabricator.wikimedia.org/T411740) (owner: 10Gehel) [10:21:35] (03PS1) 10Bartosz Wójtowicz: ml-services: Deploy experimental CPU-only revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215112 (https://phabricator.wikimedia.org/T411758) [10:22:38] (03CR) 10Muehlenhoff: installserver: Add UEFI recipe to future clouddb* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214777 (owner: 10Marostegui) [10:29:28] (03CR) 10Marostegui: installserver: Add UEFI recipe to future clouddb* (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214777 (owner: 10Marostegui) [10:30:45] (03PS1) 10Filippo Giunchedi: hieradata: enable paging for labweb-ssl service and route to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) [10:33:44] (03PS2) 10Marostegui: installserver: Add UEFI recipe to future clouddb* [puppet] - 10https://gerrit.wikimedia.org/r/1214777 [10:34:28] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1214777 (owner: 10Marostegui) [10:35:31] (03CR) 10Muehlenhoff: [C:03+2] Remove notebook-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215065 (owner: 10Muehlenhoff) [10:35:38] (03CR) 10Marostegui: [C:03+2] installserver: Add UEFI recipe to future clouddb* [puppet] - 10https://gerrit.wikimedia.org/r/1214777 (owner: 10Marostegui) [10:36:21] (03PS1) 10Isabelle Hurbain-Palatin: Activate postprocessing cache on testwiki, test2wiki, officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) [10:37:21] (03CR) 10Btullis: [C:03+1] "Late, but thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1215065 (owner: 10Muehlenhoff) [10:39:47] (03PS2) 10Muehlenhoff: Remove labnet-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215067 [10:41:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:42:13] (03PS3) 10Tchanders: Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618) [10:43:30] (03CR) 10Tiziano Fogli: [C:03+2] Blackbox/check: strengthen suffix matching regex in generated rules [puppet] - 10https://gerrit.wikimedia.org/r/1208365 (https://phabricator.wikimedia.org/T410745) (owner: 10Tiziano Fogli) [10:44:13] (03CR) 10Elukey: [C:03+1] Stop applying the os-installers group on cumin* and cloudcumin* nodes [puppet] - 10https://gerrit.wikimedia.org/r/1215073 (https://phabricator.wikimedia.org/T358361) (owner: 10Muehlenhoff) [10:45:20] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618) (owner: 10Tchanders) [10:47:04] (03CR) 10FNegri: "I have never seen this file before, where is it parsed? I see that the "team:" annotation is not used for any other service, is it going t" [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [10:49:16] (03CR) 10STran: [C:03+1] Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618) (owner: 10Tchanders) [10:51:13] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11432052 (10fgiunchedi) I took a look at why cloudcephosd1052 still has second nic up, currently: ` 4: ens1f1np1: mtu 9000 qdisc mq stat... [10:51:52] (03PS1) 10Federico Ceratto: clone.py: Accept both hostname and FQDN [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) [10:54:51] (03CR) 10Marostegui: "Can you test it with some hosts?" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215116 (https://phabricator.wikimedia.org/T391581) (owner: 10Federico Ceratto) [10:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:56:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:59:13] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook-next: apply [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1100) [11:00:26] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthboo-next: apply [11:01:37] (03PS1) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119 [11:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:02:11] (03CR) 10Muehlenhoff: [C:03+2] Remove labnet-users POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215067 (owner: 10Muehlenhoff) [11:02:31] (03PS2) 10Muehlenhoff: Remove eventbus-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215068 [11:04:28] (03CR) 10Muehlenhoff: [C:03+2] Remove eventbus-admins POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215068 (owner: 10Muehlenhoff) [11:04:46] (03PS2) 10Muehlenhoff: Remove gpu-testers POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215069 [11:05:59] (03CR) 10Muehlenhoff: [C:03+2] Remove gpu-testers POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1215069 (owner: 10Muehlenhoff) [11:06:09] (03CR) 10Arnaudb: [C:03+1] "lgtm, small question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1215111 (https://phabricator.wikimedia.org/T411240) (owner: 10Jelto) [11:07:02] (03CR) 10Muehlenhoff: [C:03+2] Stop applying the os-installers group on cumin* and cloudcumin* nodes [puppet] - 10https://gerrit.wikimedia.org/r/1215073 (https://phabricator.wikimedia.org/T358361) (owner: 10Muehlenhoff) [11:10:00] (03CR) 10Arnaudb: [C:03+1] "lgtm" [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [11:10:50] (03CR) 10Jelto: sre.gitlab.upgrade: mask ldap group sync during upgrades (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1215111 (https://phabricator.wikimedia.org/T411240) (owner: 10Jelto) [11:14:09] (03CR) 10Filippo Giunchedi: "service::catalog is used in various bits of the infra to configure e.g. the load balancers and alerting. It is going to work as per https:" [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [11:14:49] (03CR) 10Isabelle Hurbain-Palatin: [C:04-2] "OR: let's not do that just yet, I think there's a bug in the previous patch." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [11:17:41] (03PS1) 10Hashar: Add banner for the 2025 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215120 [11:20:25] (03CR) 10Majavah: [C:03+1] "lgtm, but see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [11:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:21:44] !log rebuild software RAIDs on T410743 [11:21:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:49] T410743: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743 [11:24:58] (03CR) 10Elukey: [C:03+1] Drop use of MW_APPSERVER_NETWORKS for ircstream now that mw* servers are gone [puppet] - 10https://gerrit.wikimedia.org/r/1214094 (https://phabricator.wikimedia.org/T411508) (owner: 10Muehlenhoff) [11:26:26] (03CR) 10Filippo Giunchedi: hieradata: enable paging for labweb-ssl service and route to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [11:26:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:27:15] (03PS1) 10FNegri: P:toolforge:prometheus: scrape mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) [11:27:17] FIRING: [21x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:27:31] 06SRE, 10observability: thanos-store OOMing on titan eqiad - https://phabricator.wikimedia.org/T411343#11432122 (10hnowlan) I think the worst of this trend has been reversed by the revert of setting cutoff days to 1: https://grafana.wikimedia.org/goto/rwrkdsWDg?orgId=1 {F70853481} [11:29:30] (03PS2) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119 [11:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:30:47] (03CR) 10Clément Goubert: "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214633 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler) [11:30:55] (03CR) 10Majavah: [C:03+1] hieradata: enable paging for labweb-ssl service and route to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [11:31:31] !log installing net-snmp security updates [11:31:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:32:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:32:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:33:35] (03Abandoned) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215089 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [11:37:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:37:17] FIRING: [23x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:37:22] (03CR) 10Brouberol: [C:03+2] Update documentation for rdf_functions.sh path in dumpwikibaserdf.sh [dumps] - 10https://gerrit.wikimedia.org/r/1204598 (https://phabricator.wikimedia.org/T408800) (owner: 10Itamar Givon) [11:42:02] FIRING: [13x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:42:17] FIRING: [26x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:45:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [11:47:17] FIRING: [26x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:50:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:52:02] FIRING: [13x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:52:41] (03CR) 10Filippo Giunchedi: hieradata: enable paging for labweb-ssl service and route to wmcs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [11:53:44] (03CR) 10Filippo Giunchedi: [C:03+2] hieradata: enable paging for labweb-ssl service and route to wmcs [puppet] - 10https://gerrit.wikimedia.org/r/1215114 (https://phabricator.wikimedia.org/T411470) (owner: 10Filippo Giunchedi) [11:56:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [11:57:02] FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [11:57:17] FIRING: [24x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:56] 06SRE, 06DC-Ops: Remove second network connection for cloudcephosd hosts with single uplink - https://phabricator.wikimedia.org/T410989#11432247 (10cmooney) 05Resolved→03Open Thanks @VRiley-WMF. I'm gonna re-open this as we still have to deal with cloudcephosd1052. [11:58:56] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#11432253 (10cmooney) >>! In T399180#11432052, @fgiunchedi wrote: > I think the easiest would be to: > > * Remove the spurious `enp13s0f1np1` config, run puppet to verify... [12:00:01] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:01:38] (03CR) 10Jelto: [C:03+2] sre.gitlab.upgrade: mask ldap group sync during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1215111 (https://phabricator.wikimedia.org/T411240) (owner: 10Jelto) [12:01:43] RESOLVED: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:02:02] FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:02:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:02:13] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan) - https://phabricator.wikimedia.org/T411365#11432264 (10hnowlan) This is resolved, thank you! [12:02:17] FIRING: [23x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:02:25] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan) - https://phabricator.wikimedia.org/T411365#11432265 (10hnowlan) 05Open→03Resolved a:03andrea.denisse [12:07:26] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: mask ldap group sync during upgrades [cookbooks] - 10https://gerrit.wikimedia.org/r/1215111 (https://phabricator.wikimedia.org/T411240) (owner: 10Jelto) [12:09:12] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:02] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.165 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:12:02] FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:15:01] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:16:13] (03CR) 10Lucas Werkmeister (WMDE): "The config change looks good to me, but IIUC Product should confirm that we’re ready for deployment before this is deployed." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214986 (https://phabricator.wikimedia.org/T403015) (owner: 10Arthur taylor) [12:17:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:56] (03PS1) 10Vgutierrez: admin: Add backup FIDO key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1215134 [12:22:17] FIRING: [23x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:22:29] (03CR) 10Muehlenhoff: [C:03+1] releases: delete now pointless classes for deprecated user groups [puppet] - 10https://gerrit.wikimedia.org/r/1214612 (owner: 10Dzahn) [12:23:31] (03PS2) 10Vgutierrez: admin: Add backup FIDO key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1215134 [12:30:46] (03PS6) 10Slyngshede: C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) [12:30:51] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [12:31:56] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1215134 (owner: 10Vgutierrez) [12:39:20] (03PS1) 10Gehel: query_service: only alert when individual servers are down for > 2h [puppet] - 10https://gerrit.wikimedia.org/r/1215144 (https://phabricator.wikimedia.org/T411772) [12:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:40:19] (03CR) 10Klausman: [C:03+1] Move ml-serve1013 to a ML k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1215088 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [12:42:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:43:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:45:36] !log installing postgresql-15 security updates [12:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:54] (03PS7) 10Slyngshede: C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) [12:48:06] (03CR) 10CI reject: [V:04-1] C:mtail update trafficserver_backend_requests_seconds [puppet] - 10https://gerrit.wikimedia.org/r/1214531 (https://phabricator.wikimedia.org/T411584) (owner: 10Slyngshede) [12:50:01] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [12:51:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [12:52:14] (03PS7) 10Dpogorzelski: ml-build: define new machine name/type [puppet] - 10https://gerrit.wikimedia.org/r/1213972 (https://phabricator.wikimedia.org/T394778) [12:53:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:57:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:42] 06SRE, 10SRE-Access-Requests: Requesting access to ops-limited for JavierMonton - https://phabricator.wikimedia.org/T411774 (10JMonton-WMF) 03NEW [13:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1300) [13:00:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:02:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:05:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [13:07:42] !log installing waitress security updates [13:07:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:28] (03CR) 10Daniel Kinzler: [C:03+2] rest gateway: use the new x-trusted-request header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214633 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler) [13:12:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:13:10] (03Merged) 10jenkins-bot: rest gateway: use the new x-trusted-request header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214633 (https://phabricator.wikimedia.org/T410379) (owner: 10Daniel Kinzler) [13:13:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:14:30] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:14:30] (03PS1) 10JavierMonton: topic: ops-limited access [puppet] - 10https://gerrit.wikimedia.org/r/1215152 (https://phabricator.wikimedia.org/T411774) [13:15:12] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:15:19] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:15:54] !log daniel@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [13:15:55] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for JavierMonton - https://phabricator.wikimedia.org/T411774#11432492 (10JMonton-WMF) In case this is approved, I created the patch I believe is needed, to help with the process. https://gerrit.wikimedia.org/r/c/operations/pu... [13:16:34] !log daniel@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [13:17:02] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.082 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:17:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.195 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:18:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:18:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:19:01] !log daniel@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [13:19:25] !log daniel@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [13:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:21:45] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11432505 (10Jclark-ctr) Memory should be delivered today. Is this server still in service or can it be replaced any time? [13:21:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:22:13] !log daniel@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [13:22:36] !log daniel@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [13:24:00] (03CR) 10Vgutierrez: [C:03+2] admin: Add backup FIDO key for vgutierrez [puppet] - 10https://gerrit.wikimedia.org/r/1215134 (owner: 10Vgutierrez) [13:25:39] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T410589)', diff saved to https://phabricator.wikimedia.org/P86402 and previous config saved to /var/cache/conftool/dbconfig/20251204-132539-ladsgroup.json [13:25:43] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [13:27:02] FIRING: [11x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:28:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for JavierMonton - https://phabricator.wikimedia.org/T411774#11432534 (10MoritzMuehlenhoff) ops-limited is very broad access, it grants access to any of our 2400 server, including some very sensitive ones. But if this access i... [13:29:08] (03CR) 10Majavah: [C:03+1] P:toolforge:prometheus: scrape mariadb metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) (owner: 10FNegri) [13:31:49] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11432550 (10MoritzMuehlenhoff) It's depooled and monitoring disabled, you can replace any time [13:32:02] FIRING: [13x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:32:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:33:47] (03PS1) 10Clément Goubert: mediawiki: Keep cronjobs for a week after completion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215155 [13:35:11] FIRING: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:35:36] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to ops-limited for JavierMonton - https://phabricator.wikimedia.org/T411774#11432568 (10BTullis) >>! In T411774#11432534, @MoritzMuehlenhoff wrote: > ops-limited is very broad access, it grants access to any of our 2400 server, including som... [13:37:02] FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:37:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:39:04] (03PS1) 10Btullis: Add a growthbook system user and grant it access to private data [puppet] - 10https://gerrit.wikimedia.org/r/1215156 (https://phabricator.wikimedia.org/T406593) [13:40:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P86403 and previous config saved to /var/cache/conftool/dbconfig/20251204-134046-ladsgroup.json [13:41:17] (03PS1) 10Muehlenhoff: Create a new access group for access to Jumbo Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/1215157 (https://phabricator.wikimedia.org/T411774) [13:41:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:42:02] FIRING: [14x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:44:39] (03PS2) 10Muehlenhoff: Create a new access group for access to Jumbo Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/1215157 (https://phabricator.wikimedia.org/T411774) [13:45:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215157 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff) [13:48:38] (03CR) 10Bearloga: [C:03+1] Add a growthbook system user and grant it access to private data [puppet] - 10https://gerrit.wikimedia.org/r/1215156 (https://phabricator.wikimedia.org/T406593) (owner: 10Btullis) [13:48:45] (03PS2) 10FNegri: P:toolforge:prometheus: scrape mariadb metrics [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) [13:49:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:51:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [13:52:02] FIRING: [13x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:52:32] (03CR) 10FNegri: P:toolforge:prometheus: scrape mariadb metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215121 (https://phabricator.wikimedia.org/T410505) (owner: 10FNegri) [13:53:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:55:28] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack link to asw2-c2-eqiad xe-2/0/13 - https://phabricator.wikimedia.org/T411781 (10cmooney) 03NEW p:05Triage→03Medium [13:55:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P86404 and previous config saved to /var/cache/conftool/dbconfig/20251204-135554-ladsgroup.json [13:56:05] (03PS1) 10Elukey: service.py: add the team field in the Service's definition [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) [13:57:47] (03CR) 10Elukey: [C:03+2] Move ml-serve1013 to a ML k8s worker [puppet] - 10https://gerrit.wikimedia.org/r/1215088 (https://phabricator.wikimedia.org/T403697) (owner: 10Elukey) [14:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1400). [14:00:05] Kizule and Tchanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:14] o/ [14:00:16] o/ [14:00:30] I need to run afk, can someone else deploy? otherwise I should be able to later in the window [14:01:02] (03PS1) 10Jforrester: RevisionStore: Catch ParameterAssertionException too [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215164 (https://phabricator.wikimedia.org/T351953) [14:01:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:02:03] I'll get started on mine... [14:02:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:02:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tchanders@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618) (owner: 10Tchanders) [14:03:13] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack link to asw2-c2-eqiad xe-2/0/13 - https://phabricator.wikimedia.org/T411781#11432684 (10Vgutierrez) the assessment is OK and the link can be removed safely [14:03:22] (03Merged) 10jenkins-bot: Enable temporary accounts on enwikinews and ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214489 (https://phabricator.wikimedia.org/T411618) (owner: 10Tchanders) [14:03:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:03:42] !log tchanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1214489|Enable temporary accounts on enwikinews and ptwikibooks (T411618)]] [14:03:45] T411618: Deploy Temporary accounts to the two remaining former LQT wikis - https://phabricator.wikimedia.org/T411618 [14:04:06] (03CR) 10CI reject: [V:04-1] service.py: add the team field in the Service's definition [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [14:05:28] (03CR) 10Bking: [C:03+1] query_service: only alert when individual servers are down for > 2h [puppet] - 10https://gerrit.wikimedia.org/r/1215144 (https://phabricator.wikimedia.org/T411772) (owner: 10Gehel) [14:05:53] (03PS1) 10D3r1ck01: Revert "User: Log where the data was loaded when CAS update failed" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215165 (https://phabricator.wikimedia.org/T410652) [14:06:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11432701 (10Jclark-ctr) [14:06:10] (03PS1) 10D3r1ck01: Revert "User: Log where the data was loaded when CAS update failed" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1215166 (https://phabricator.wikimedia.org/T410652) [14:06:13] !log tchanders@deploy2002 tchanders: Backport for [[gerrit:1214489|Enable temporary accounts on enwikinews and ptwikibooks (T411618)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:06:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215165 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:06:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#depl" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1215166 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:07:02] testing... [14:07:10] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 8.261 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:10] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 8.369 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:14] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:07:29] o/ [14:07:53] Looks good, continuing.. [14:08:03] Tchanders, please poke me when you're done. I have some backports to deploy. [14:08:06] !log tchanders@deploy2002 tchanders: Continuing with sync [14:08:19] OK, will do! [14:08:28] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:09:19] (03PS1) 10D3r1ck01: Fetch user object from primary DB (for writes) not replica DB [extensions/EmailAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215167 (https://phabricator.wikimedia.org/T410652) [14:09:34] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661#11432726 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate to T411781 [14:09:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: lvs1018: decom links to asw2-c2-eqiad and asw2-d7-eqiad - https://phabricator.wikimedia.org/T410661#11432732 (10cmooney) 05Resolved→03Declined Duplicate task made in error, will use T411781 [14:10:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack link to asw2-c2-eqiad xe-2/0/13 - https://phabricator.wikimedia.org/T411781#11432739 (10Jclark-ctr) https://netbox.wikimedia.org/dcim/interfaces/29150/trace/ https://netbox.wikimedia.org/dcim/interfaces/29151... [14:11:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T410589)', diff saved to https://phabricator.wikimedia.org/P86405 and previous config saved to /var/cache/conftool/dbconfig/20251204-141101-ladsgroup.json [14:11:05] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [14:11:13] I can take over once you're done [14:11:17] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton - https://phabricator.wikimedia.org/T411774#11432740 (10BTullis) [14:11:17] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:11:25] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T410589)', diff saved to https://phabricator.wikimedia.org/P86406 and previous config saved to /var/cache/conftool/dbconfig/20251204-141124-ladsgroup.json [14:11:50] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:13:04] Amir1, ack! Will ping you. [14:13:28] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:14:08] * Lucas_WMDE back [14:14:18] !log tchanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1214489|Enable temporary accounts on enwikinews and ptwikibooks (T411618)]] (duration: 10m 36s) [14:14:21] T411618: Deploy Temporary accounts to the two remaining former LQT wikis - https://phabricator.wikimedia.org/T411618 [14:14:36] I'm finished - over to you @Amir1 [14:15:13] Oh, seems Amir1 wants to go first? [14:15:26] Lucas_WMDE o/ [14:16:12] IIUC Amir1 was volunteering to do the backport for Kizule? [14:16:37] who isn’t around yet, so I’d say xSavitar go ahead with your deployments if nobody objects [14:17:07] * Lucas_WMDE watches logspam-watch be unusually slow to load o_O [14:17:16] !log gehel@cumin2002 conftool action : set/pooled=yes; selector: service=druid-public-coordinator [14:17:30] !log gehel@cumin2002 conftool action : set/weight=10; selector: service=druid-public-coordinator [14:17:31] Okay, I'll go ahead, thanks! [14:17:47] oh dear, 90k errors in logstash [14:17:48] twice [14:17:50] in the past 15 minutes [14:18:09] and logspam-watch watches 1h by default, so it would see… 720k messages [14:18:13] yeah makes sense that that’s slow [14:18:14] * Lucas_WMDE searches phab [14:18:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215165 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:18:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1215166 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:18:57] (03CR) 10TrainBranchBot: [C:03+2] "Approved by derick@deploy2002 using scap backport" [extensions/EmailAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215167 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:19:50] seems like T411585? [14:19:50] T411585: PHP Warning: Stats: (action_api_modules_hit_total): Stats: (action_api_modules_hit_total) Cannot associate label keys with label values - Not all initialized labels have an assigned value. - https://phabricator.wikimedia.org/T411585 [14:20:01] RESOLVED: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [14:20:31] taavi: yup, left a comment there [14:20:47] oh, now logspam-watch loaded [14:21:00] not that it’ll be very useful if it only refreshes once every five minutes or so [14:21:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:23:20] (03Merged) 10jenkins-bot: Revert "User: Log where the data was loaded when CAS update failed" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215165 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:23:25] (03Merged) 10jenkins-bot: Revert "User: Log where the data was loaded when CAS update failed" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1215166 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:23:28] (03Merged) 10jenkins-bot: Fetch user object from primary DB (for writes) not replica DB [extensions/EmailAuth] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215167 (https://phabricator.wikimedia.org/T410652) (owner: 10D3r1ck01) [14:23:50] !log derick@deploy2002 Started scap sync-world: Backport for [[gerrit:1215165|Revert "User: Log where the data was loaded when CAS update failed" (T410652)]], [[gerrit:1215166|Revert "User: Log where the data was loaded when CAS update failed" (T410652)]], [[gerrit:1215167|Fetch user object from primary DB (for writes) not replica DB (T410652)]] [14:23:54] T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652 [14:26:07] !log derick@deploy2002 d3r1ck01, derick: Backport for [[gerrit:1215165|Revert "User: Log where the data was loaded when CAS update failed" (T410652)]], [[gerrit:1215166|Revert "User: Log where the data was loaded when CAS update failed" (T410652)]], [[gerrit:1215167|Fetch user object from primary DB (for writes) not replica DB (T410652)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes [14:26:07] can now be verified there. [14:27:16] Nothing to test for now. Verifying it works means errors no longer show up in Logstash. [14:27:27] !log derick@deploy2002 d3r1ck01, derick: Continuing with sync [14:27:58] good luck looking for errors in logstash at the moment :S [14:28:21] :( [14:29:12] https://logstash.wikimedia.org/goto/71373cec585746d63094e22d911053b4 [14:29:20] That's what I'm eyeing [14:32:11] Facing an issue [14:32:16] * Lucas_WMDE nods [14:32:18] oh? [14:32:20] 14:29:10 Waiting 20 seconds for canary traffic... [14:32:20] 14:29:31 Logstash checker Counted 162 error(s) in the last 20 seconds. The threshold is 10. [14:32:20] 14:29:31 Top 3 errors: [14:32:20] [81 hits] PHP Warning: Stats: (action_api_modules_latency): Stats: (action_api_modules_latency) Cannot associate label keys with label values - Not all initialized labels have an assigned value. [14:32:20] [80 hits] PHP Warning: Stats: (action_api_modules_hit_total): Stats: (action_api_modules_hit_total) Cannot associate label keys with label values - Not all initialized labels have an assigned value. [14:32:20] [1 hits] Wikimedia\RequestTimeout\RequestTimeoutException: The maximum execution time of {limit} seconds was exceeded [14:32:34] hm [14:32:40] I would suggest retrying once [14:32:41] (03CR) 10Gehel: [C:03+1] Add Guillaume as appprover for analytics-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/1212061 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [14:32:51] Okay [14:32:54] the two errors with 80/81 hits are T411585, okay to ignore [14:32:55] T411585: PHP Warning: Stats: (action_api_modules_hit_total): Stats: (action_api_modules_hit_total) Cannot associate label keys with label values - Not all initialized labels have an assigned value. - https://phabricator.wikimedia.org/T411585 [14:33:00] the 1 hit timeout is a bit more concerning [14:33:23] (03CR) 10Gehel: [C:03+1] Add Guillaume as approver for two more analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/1212168 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [14:33:24] I think now it’s okay to proceed [14:33:33] but I’ll leave a comment on the task that it’s soft-blocking deplyoments [14:33:41] Okay, thanks! [14:33:46] (03CR) 10Gehel: [C:03+2] Hive: alert when query rate is too high (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1207790 (https://phabricator.wikimedia.org/T410528) (owner: 10Gehel) [14:34:04] (03PS2) 10Gehel: query_service: only alert when individual servers are down for > 2h [puppet] - 10https://gerrit.wikimedia.org/r/1215144 (https://phabricator.wikimedia.org/T411772) [14:35:56] (03CR) 10Gehel: [C:03+2] query_service: only alert when individual servers are down for > 2h [puppet] - 10https://gerrit.wikimedia.org/r/1215144 (https://phabricator.wikimedia.org/T411772) (owner: 10Gehel) [14:36:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:37:14] !log derick@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215165|Revert "User: Log where the data was loaded when CAS update failed" (T410652)]], [[gerrit:1215166|Revert "User: Log where the data was loaded when CAS update failed" (T410652)]], [[gerrit:1215167|Fetch user object from primary DB (for writes) not replica DB (T410652)]] (duration: 13m 24s) [14:37:18] T410652: RuntimeException: CAS update failed on user_touched. The version of the user to be saved is older than the current version. - https://phabricator.wikimedia.org/T410652 [14:38:05] Amir1, over to you if you want to deploy. Not sure if Kizule is around though. [14:38:12] Lucas_WMDE, thanks for the assistance. [14:38:15] np [14:38:19] thanks [14:38:23] yeah they don’t seem to be around AFAICT :/ [14:39:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215164 (https://phabricator.wikimedia.org/T351953) (owner: 10Jforrester) [14:40:24] it’s a SpiderPig! \o/ [14:40:40] :D [14:41:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:44:09] (03CR) 10Clément Goubert: [C:03+2] mediawiki: Keep cronjobs for a week after completion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215155 (owner: 10Clément Goubert) [14:45:35] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11432921 (10Jclark-ctr) 05Open→03Resolved a:05cmooney→03Jclark-ctr [14:47:05] (03Merged) 10jenkins-bot: mediawiki: Keep cronjobs for a week after completion [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215155 (owner: 10Clément Goubert) [14:49:36] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-cron: apply [14:50:46] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [14:52:16] (03Merged) 10jenkins-bot: RevisionStore: Catch ParameterAssertionException too [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215164 (https://phabricator.wikimedia.org/T351953) (owner: 10Jforrester) [14:52:35] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1215164|RevisionStore: Catch ParameterAssertionException too (T351953)]] [14:52:39] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [14:53:21] (03CR) 10Clément Goubert: [C:03+2] wikikube-staging: Bump calico memory requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 (owner: 10Clément Goubert) [14:53:41] (03CR) 10Michael Große: [C:03+1] "The list of wikis where this is being enabled matches the list in T410469" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214570 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [14:53:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:39] !log ladsgroup@deploy2002 jforrester, ladsgroup: Backport for [[gerrit:1215164|RevisionStore: Catch ParameterAssertionException too (T351953)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:55:04] !log ladsgroup@deploy2002 jforrester, ladsgroup: Continuing with sync [14:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [14:58:40] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'. [14:59:24] !log cgoubert@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:59:35] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] START helmfile.d/admin 'apply'. [14:59:54] !log cgoubert@deploy2002 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'apply'. [15:01:00] (03Merged) 10jenkins-bot: wikikube-staging: Bump calico memory requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/1214125 (owner: 10Clément Goubert) [15:01:11] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:01:43] (03CR) 10Michael Große: [C:03+1] [Growth] Sort the list of Add Link wikis alphabetically [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214571 (https://phabricator.wikimedia.org/T410469) (owner: 10Urbanecm) [15:01:57] (03CR) 10Muehlenhoff: [C:03+2] Add Guillaume as appprover for analytics-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/1212061 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [15:02:01] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215164|RevisionStore: Catch ParameterAssertionException too (T351953)]] (duration: 09m 26s) [15:02:05] T351953: Various old revisions are encoded as Windows-1252 rather than UTF-8, causing "RuntimeException: PCRE failure" when viewing them - https://phabricator.wikimedia.org/T351953 [15:02:12] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:02:16] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11433019 (10cmooney) [15:02:32] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:02:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11433025 (10cmooney) [15:02:43] 06SRE, 06Infrastructure-Foundations, 10netops, 06Traffic: Eqiad row C/D switch refresh: LVS changes to support migration - https://phabricator.wikimedia.org/T405602#11433026 (10cmooney) [15:03:06] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:03:12] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11433027 (10MoritzMuehlenhoff) [15:03:32] !log cgoubert@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [15:03:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Not deployed today because nobody showed up for the window, but the change looks good to me and should be okay to deploy some other time." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215060 (https://phabricator.wikimedia.org/T411750) (owner: 10Zoranzoki21) [15:03:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:03:59] !log UTC afternoon backport+config window done [15:04:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:56] !log cgoubert@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [15:06:23] !log cgoubert@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [15:06:45] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:06:50] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:08:08] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11433060 (10cmooney) [15:08:14] !log cgoubert@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [15:09:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf1007.eqiad.wmnet [15:10:00] RESOLVED: [8x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:10:01] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:08] (03CR) 10Muehlenhoff: [C:03+2] Switch conf1007 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214557 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:11:41] (03PS1) 10Slyngshede: P:idm configuration for Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) [15:14:53] (03CR) 10Slyngshede: [C:03+1] Create a new access group for access to Jumbo Kafka brokers [puppet] - 10https://gerrit.wikimedia.org/r/1215157 (https://phabricator.wikimedia.org/T411774) (owner: 10Muehlenhoff) [15:15:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf1007.eqiad.wmnet [15:16:28] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7790/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) (owner: 10Slyngshede) [15:16:56] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: ULSFO: New switch configuration - https://phabricator.wikimedia.org/T408892#11433083 (10cmooney) >>! In T408892#11330727, @cmooney wrote: > Additionally for the rebuild we should aim to: > > # Convert the existing ganeti hosts to rout... [15:20:29] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf1008.eqiad.wmnet [15:20:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:21:51] (03CR) 10Muehlenhoff: [C:03+2] Switch conf1008 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214558 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:21:52] (03PS26) 10Arnaudb: gerrit: rsync logic extraction from failover [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) [15:21:52] (03CR) 10Arnaudb: "The output of:" [cookbooks] - 10https://gerrit.wikimedia.org/r/1214466 (https://phabricator.wikimedia.org/T387833) (owner: 10Arnaudb) [15:22:10] (03PS2) 10Slyngshede: P:idm configuration for Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) [15:22:56] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7791/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) (owner: 10Slyngshede) [15:24:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2022:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:24:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton - https://phabricator.wikimedia.org/T411774#11433108 (10elukey) @JMonton-WMF Hi! I have used the kafka tools like topic mapper in the past and if not handle... [15:26:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf1008.eqiad.wmnet [15:27:44] (03PS3) 10Slyngshede: P:idm configuration for Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) [15:28:34] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7792/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) (owner: 10Slyngshede) [15:28:46] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf1009.eqiad.wmnet [15:29:11] (03CR) 10Muehlenhoff: [C:03+2] Switch conf1009 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214561 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1530) [15:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:30:55] !log bking@cumin2002 START - Cookbook sre.k8s.pool-depool-node depool for host dse-k8s-worker2003.codfw.wmnet [15:32:34] (03CR) 10Scott French: "Thank you for catching this, Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1215119 (owner: 10Vgutierrez) [15:33:12] (03PS4) 10Slyngshede: P:idm configuration for Phabricator linking [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) [15:33:29] (03PS2) 10Volans: service.py: add the team field in the Service's definition [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [15:33:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf1009.eqiad.wmnet [15:33:54] (03CR) 10Slyngshede: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7794/co" [puppet] - 10https://gerrit.wikimedia.org/r/1215186 (https://phabricator.wikimedia.org/T411775) (owner: 10Slyngshede) [15:34:27] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:35:01] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:35:58] !log bking@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) depool for host dse-k8s-worker2003.codfw.wmnet [15:36:11] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf2004.codfw.wmnet [15:36:44] (03CR) 10Muehlenhoff: [C:03+2] Switch conf2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214553 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:37:28] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:38:11] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:38:16] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [15:39:16] (03CR) 10Thcipriani: [C:04-1] "2024 -> 2025" [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215120 (owner: 10Hashar) [15:41:01] (03PS2) 10Hashar: Add banner for the 2025 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215120 [15:41:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf2004.codfw.wmnet [15:41:28] (03PS3) 10Volans: service.py: add the team field in the Service's definition [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [15:41:30] (03CR) 10Hashar: [C:03+2] Add banner for the 2025 developer survey (031 comment) [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215120 (owner: 10Hashar) [15:41:50] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:42:06] (03PS4) 10Volans: service.py: add the team field in the Service's definition [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [15:42:13] (03Merged) 10jenkins-bot: Add banner for the 2025 developer survey [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215120 (owner: 10Hashar) [15:42:20] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting a new group allowing shell access to kafka-jumbo servers - with membership for JavierMonton - https://phabricator.wikimedia.org/T411774#11433192 (10JMonton-WMF) Hi @elukey! We don't need this often to be honest, maybe it's more about being able to... [15:42:27] RESOLVED: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:43:01] (03CR) 10Volans: "@elukey, I took the liberty to mangle a bit the patch, it should pass CI, as for the default value I think empty string is fine to represe" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [15:43:13] !log hashar@deploy2002 Started deploy [gerrit/gerrit@774e2ff]: Ease configuration of the motd banner && Add banner for the 2025 developer survey [15:43:28] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@774e2ff]: Ease configuration of the motd banner && Add banner for the 2025 developer survey (duration: 00m 15s) [15:44:35] (03CR) 10Elukey: "Really nice thanks, I was checking the CI failures at the moment :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [15:44:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host conf2005.codfw.wmnet [15:45:17] !log bking@cumin2002 START - Cookbook sre.k8s.pool-depool-node pool for host dse-k8s-worker2003.codfw.wmnet [15:45:20] !log bking@cumin2002 END (PASS) - Cookbook sre.k8s.pool-depool-node (exit_code=0) pool for host dse-k8s-worker2003.codfw.wmnet [15:45:27] FIRING: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:45:34] (03CR) 10Muehlenhoff: [C:03+2] Switch conf2005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1214550 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [15:46:50] FIRING: [2x] KubernetesCalicoDown: dse-k8s-worker2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:47:17] FIRING: [22x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:48:55] (03PS1) 10Hashar: Remove duplicate [DISMISS] button [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215197 [15:49:13] (03CR) 10CI reject: [V:04-1] Remove duplicate [DISMISS] button [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215197 (owner: 10Hashar) [15:50:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host conf2005.codfw.wmnet [15:50:27] FIRING: [2x] SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:50:33] !log dpogorzelski@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on ml-lab1001.eqiad.wmnet with reason: decomission [15:50:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:51:29] !log dpogorzelski@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4 days, 0:00:00 on ml-lab1001.eqiad.wmnet with reason: decomission [15:52:17] RESOLVED: [2x] ProbeDown: Service wdqs2022:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2022:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:46] (03PS2) 10Hashar: Remove duplicate [DISMISS] button [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215197 [15:53:06] (03CR) 10Hashar: [C:03+2] Remove duplicate [DISMISS] button [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215197 (owner: 10Hashar) [15:53:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:55] (03Merged) 10jenkins-bot: Remove duplicate [DISMISS] button [software/gerrit] (deploy/wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1215197 (owner: 10Hashar) [15:54:16] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [15:55:01] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [15:55:25] !log hashar@deploy2002 Started deploy [gerrit/gerrit@121bd1c]: Remove duplicate [DISMISS] button [15:55:27] RESOLVED: SystemdUnitCrashLoop: prometheus-blazegraph-exporter-wdqs-blazegraph.service crashloop on wdqs2021:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [15:55:37] !log hashar@deploy2002 Finished deploy [gerrit/gerrit@121bd1c]: Remove duplicate [DISMISS] button (duration: 00m 11s) [16:00:05] hashar and jnuche: Deploy window Train log triage (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1600) [16:02:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:05:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:07:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:10:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:10:57] (03PS1) 10Gehel: query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) [16:12:10] (03CR) 10CI reject: [V:04-1] query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) (owner: 10Gehel) [16:12:57] (03CR) 10Elukey: [C:03+2] service.py: add the team field in the Service's definition [software/spicerack] - 10https://gerrit.wikimedia.org/r/1215162 (https://phabricator.wikimedia.org/T399807) (owner: 10Elukey) [16:13:37] (03PS1) 10Muehlenhoff: Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) [16:14:02] (03PS2) 10Gehel: query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) [16:15:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:15:43] (03CR) 10CI reject: [V:04-1] query_service: relax alerting on WDQS lag [alerts] - 10https://gerrit.wikimedia.org/r/1215201 (https://phabricator.wikimedia.org/T411772) (owner: 10Gehel) [16:15:51] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1213588 (owner: 10Jasmine) [16:17:02] RESOLVED: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:18:37] (03CR) 10JHathaway: [C:03+1] Remove puppetmaster1003 from active Puppet 5 servers [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:31:43] (03CR) 10Elukey: [C:03+1] "Can we also remove puppetmaster2002 ?" [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:33:58] (03CR) 10Muehlenhoff: "Sure, but that's for a separate patch, I'll be coupling these patches with running the decom script." [puppet] - 10https://gerrit.wikimedia.org/r/1215202 (https://phabricator.wikimedia.org/T365798) (owner: 10Muehlenhoff) [16:40:46] (03PS3) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning [puppet] - 10https://gerrit.wikimedia.org/r/1215119 [16:41:27] (03CR) 10Vgutierrez: cache::haproxy: Get rid of http-request after use_backend warning (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1215119 (owner: 10Vgutierrez) [16:42:31] (03CR) 10CDanis: [C:03+2] Filter another client adding noise [puppet] - 10https://gerrit.wikimedia.org/r/1214759 (owner: 10Jdlrobson) [16:47:28] (03PS2) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) [16:48:22] (03PS3) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) [16:49:22] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:49:57] (03CR) 10CI reject: [V:04-1] services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:51:55] (03PS3) 10Cathal Mooney: lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628) [16:52:21] (03PS4) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) [16:53:52] (03CR) 10CI reject: [V:04-1] services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [16:57:12] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:57:12] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:58:45] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11433516 (10herron) [16:59:15] 10SRE-SLO: Sloth: onboard subset of existing SLOs to pilot - https://phabricator.wikimedia.org/T409310#11433517 (10herron) onboarded wikifunctions today as well with config: ` # This example shows a simple service level by implementing a single SLO without alerts. # It disables page (critical) and ticket (warni... [17:00:05] jhathaway and rzl: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1700). [17:00:05] No Gerrit patches in the queue for this window AFAICS. [17:03:48] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11433548 (10Jclark-ctr) Replaced the failed DIMM @MoritzMuehlenhoff. I swapped A7 and B7 after replacing B7 so it’s easier to troubleshoot later if the issue comes back. [17:05:33] (03PS5) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) [17:05:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1229 crashed - Broken memory module at B7 - https://phabricator.wikimedia.org/T411652#11433551 (10Jclark-ctr) 05Open→03Resolved [17:06:00] !log brett@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: move primary uplink from move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - T405628 [17:06:04] T405628: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628 [17:06:05] !log disable BGP to lvs1019 on eqiad coure routers ahead of switch migration T405628 [17:06:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433554 (10BCornwall) [17:07:02] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:02] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9234 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:08] (03CR) 10CI reject: [V:04-1] services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) (owner: 10Elukey) [17:07:12] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:11:04] (03CR) 10BCornwall: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7795/co" [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628) (owner: 10Cathal Mooney) [17:14:48] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433580 (10BCornwall) [17:15:26] (03CR) 10Cathal Mooney: [C:03+2] lvs1019: move row D vlans to primary and add new C/D per-rack vlans [puppet] - 10https://gerrit.wikimedia.org/r/1207891 (https://phabricator.wikimedia.org/T405628) (owner: 10Cathal Mooney) [17:16:40] (03PS6) 10Elukey: services: add maps-next.w.o as FQDN for kartotherian staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215098 (https://phabricator.wikimedia.org/T409528) [17:17:50] !log vriley@cumin1003 START - Cookbook sre.dns.netbox [17:20:12] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:20:28] !log vriley@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:20:53] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:21:10] !log vriley@cumin1003 START - Cookbook sre.network.configure-switch-interfaces for host franio1004 [17:21:12] (03PS1) 10Santiago Faci: ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215214 (https://phabricator.wikimedia.org/T407570) [17:21:14] !log vriley@cumin1003 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host franio1004 [17:22:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433608 (10cmooney) [17:28:50] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 3 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433621 (10BCornwall) [17:29:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11433626 (10RobH) [17:30:02] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: lvs1018: remove cross-rack links to rows A, C and D - https://phabricator.wikimedia.org/T411781#11433627 (10RobH) [17:30:29] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host lvs1019.eqiad.wmnet with OS bullseye [17:30:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433629 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@c... [17:31:35] (03CR) 10Clare Ming: [C:03+1] ext.wikimediaEvents: Add xLab impactTest experiment-specific instrument [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215214 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [17:32:03] (03PS2) 10Isabelle Hurbain-Palatin: Activate postprocessing cache on testwiki, test2wiki, officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) [17:32:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11433638 (10RobH) Day 13 Update: * all hosts in rows C and D migrated ** lvs1018 in row B has links into C and D need removal via T411781 before we can kill... [17:33:07] (03CR) 10Isabelle Hurbain-Palatin: "I'm removing my own -2 because I think it's not CRITICAL to not merge this, BUT I'd really like I806fa84d5d7837b21709ce8997c2b02a8b9548e2 " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [17:38:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, December 09 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [extensions/WikimediaEvents] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215214 (https://phabricator.wikimedia.org/T407570) (owner: 10Santiago Faci) [17:38:51] FIRING: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/32 (Transport: lvs1019:enp94s0f0np0 (Equinix, 21989994) {#20220411}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:43:51] RESOLVED: SwitchCoreInterfaceDown: Switch core interface down - ssw1-f1-eqiad:xe-0/0/32 (Transport: lvs1019:enp94s0f0np0 (Equinix, 21989994) {#20220411}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Switch_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=ssw1-f1-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DSwitchCoreInterfaceDown [17:45:06] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs1019.eqiad.wmnet with reason: host reimage [17:45:45] (03CR) 10Brouberol: [C:03+1] Add a growthbook system user and grant it access to private data [puppet] - 10https://gerrit.wikimedia.org/r/1215156 (https://phabricator.wikimedia.org/T406593) (owner: 10Btullis) [17:47:29] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: eqiad: rows C/D Upgrade Tracking - https://phabricator.wikimedia.org/T404609#11433701 (10RobH) [17:47:47] jouncebot: nowandnext [17:47:47] For the next 0 hour(s) and 12 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1700) [17:47:47] In 0 hour(s) and 12 minute(s): Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1800) [17:47:47] In 0 hour(s) and 12 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1800) [17:48:05] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs1019.eqiad.wmnet with reason: host reimage [17:59:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:00:05] bd808: #bothumor My software never has bugs. It just develops random features. Rise for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1800). [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T1800) [18:01:54] nothing for my window this week [18:04:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433761 (10Jclark-ctr) [18:05:54] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs1019.eqiad.wmnet with OS bullseye [18:06:01] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433764 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin... [18:09:08] !log bking@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:09:12] !log bking@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [18:13:28] (03CR) 10Dzahn: [C:03+2] admin/releases: deprecate shell user group releasers-mwcli [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [18:15:02] (03CR) 10Dzahn: [C:03+2] releases: delete now pointless classes for deprecated user groups [puppet] - 10https://gerrit.wikimedia.org/r/1214612 (owner: 10Dzahn) [18:15:08] (03PS2) 10Dzahn: releases: delete now pointless classes for deprecated user groups [puppet] - 10https://gerrit.wikimedia.org/r/1214612 [18:16:10] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433805 (10cmooney) [18:16:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433810 (10cmooney) [18:17:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433811 (10Jclark-ctr) [18:17:36] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1020: move primary uplink from asw2-d7-eqiad to lsw1-d7-eqiad and remove link to asw2-c2-eqiad - https://phabricator.wikimedia.org/T405609#11433812 (10cmooney) [18:18:46] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433817 (10cmooney) 05Open→03Resolved [18:18:54] (03PS1) 10Isabelle Hurbain-Palatin: kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215223 (https://phabricator.wikimedia.org/T383328) [18:18:54] (03CR) 10Dzahn: [C:03+2] releases: delete now pointless classes for deprecated user groups [puppet] - 10https://gerrit.wikimedia.org/r/1214612 (owner: 10Dzahn) [18:21:12] !log brett@cumin2002 START - Cookbook sre.hosts.remove-downtime for lvs1019.eqiad.wmnet [18:21:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for lvs1019.eqiad.wmnet [18:21:59] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: lvs1019: move primary uplink from asw2-c7-eqiad to lsw1-c7-eqiad and remove link to asw2-d2-eqiad - https://phabricator.wikimedia.org/T405628#11433829 (10BCornwall) [18:23:13] (03PS1) 10Jforrester: CdxDialog: use-close-button prop needs to be set to true [extensions/WikiLambda] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215224 (https://phabricator.wikimedia.org/T411655) [18:23:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/WikiLambda] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215224 (https://phabricator.wikimedia.org/T411655) (owner: 10Jforrester) [18:24:20] (03PS6) 10Dzahn: service: add gerrit-https service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) [18:24:24] (03CR) 10Dzahn: service: add gerrit-https service to service catalog (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [18:24:39] (03CR) 10Dzahn: "thank you! great" [puppet] - 10https://gerrit.wikimedia.org/r/1211549 (https://phabricator.wikimedia.org/T338470) (owner: 10Arnaudb) [18:25:13] (03CR) 10Dzahn: "cool :)" [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [18:25:30] (03CR) 10Dzahn: [C:03+1] vrts: add high inode usage alert [alerts] - 10https://gerrit.wikimedia.org/r/1214034 (https://phabricator.wikimedia.org/T411452) (owner: 10AOkoth) [18:29:28] (03CR) 10Scott French: [C:03+1] "Thanks, Valentin!" [puppet] - 10https://gerrit.wikimedia.org/r/1215119 (owner: 10Vgutierrez) [18:31:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11433840 (10VRiley-WMF) [18:33:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11433856 (10VRiley-WMF) Set the IP address for iDRAC, enabled IPMI, and user config information. [18:33:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:rack/setup/install franio1004 - https://phabricator.wikimedia.org/T405980#11433857 (10VRiley-WMF) a:05VRiley-WMF→03Jgreen [18:34:32] (03PS1) 10Dzahn: miscweb: add wikipedia25.org to extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) [18:35:26] (03PS2) 10Dzahn: miscweb: add wikipedia25.org to extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) [18:37:09] (03CR) 10Jgiannelos: [C:03+1] kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215223 (https://phabricator.wikimedia.org/T383328) (owner: 10Isabelle Hurbain-Palatin) [18:37:14] (03CR) 10Jgiannelos: [C:03+2] kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215223 (https://phabricator.wikimedia.org/T383328) (owner: 10Isabelle Hurbain-Palatin) [18:38:29] (03PS1) 10Clare Ming: Test Kitchen UI: Deploying v1.1.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215226 [18:38:49] (03PS5) 10Aaron Schulz: Remove /data-parsoid/ endpoint per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) [18:39:02] (03Merged) 10jenkins-bot: kartotherian: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215223 (https://phabricator.wikimedia.org/T383328) (owner: 10Isabelle Hurbain-Palatin) [18:39:24] (03PS1) 10Clare Ming: Test Kitchen UI: Deploying v1.1.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215230 [18:39:31] (03PS1) 10Dzahn: miscweb: add wikipedia25 release (WIP) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215231 (https://phabricator.wikimedia.org/T408592) [18:40:11] (03PS6) 10Aaron Schulz: Remove /data-parsoid/ endpoint from specs per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) [18:40:32] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [18:41:32] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploying v1.1.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215226 (owner: 10Clare Ming) [18:41:42] (03CR) 10Santiago Faci: [C:03+2] Test Kitchen UI: Deploying v1.1.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215230 (owner: 10Clare Ming) [18:43:22] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.3 release to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215226 (owner: 10Clare Ming) [18:43:23] (03Merged) 10jenkins-bot: Test Kitchen UI: Deploying v1.1.3 release to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215230 (owner: 10Clare Ming) [18:45:57] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [18:46:37] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [18:50:15] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [18:50:48] !log cjming@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [18:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:57:01] (03CR) 10Andrew Bogott: [C:03+1] openstack: puppet: Do not commit empty role fiels [puppet] - 10https://gerrit.wikimedia.org/r/1214491 (owner: 10Majavah) [19:02:00] (03PS5) 10CDanis: tcpproxy: include profile::lvs::realserver in role [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:02:29] (03PS1) 10Kosta Harlan: hCaptcha: Persist the captcha consequence in the user session [extensions/ConfirmEdit] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215234 (https://phabricator.wikimedia.org/T410657) [19:03:01] (03PS6) 10Dzahn: tcpproxy: include profile::lvs::realserver in role [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) [19:03:19] (03PS7) 10CDanis: tcpproxy: include profile::lvs::realserver in role [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:03:20] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:04:56] I'm going to backport a patch to wmf.5, unless someone else is deploying now [19:05:47] (03CR) 10Dzahn: [C:03+2] conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:06:02] (03PS2) 10Jelto: conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259) [19:06:27] (03CR) 10Dzahn: [C:03+2] conftool-data: add tcp-proxy gerrit service [puppet] - 10https://gerrit.wikimedia.org/r/1214454 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:06:34] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/ConfirmEdit] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215234 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [19:09:02] (03CR) 10Dzahn: [C:03+2] service: add gerrit-https service to service catalog [puppet] - 10https://gerrit.wikimedia.org/r/1202842 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:10:10] (03Restored) 10Dzahn: service::catalog: add gerrit-https and gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:11:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11434251 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [19:12:42] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/kartotherian: apply [19:13:22] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/kartotherian: apply [19:18:38] (03Merged) 10jenkins-bot: hCaptcha: Persist the captcha consequence in the user session [extensions/ConfirmEdit] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215234 (https://phabricator.wikimedia.org/T410657) (owner: 10Kosta Harlan) [19:19:00] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1215234|hCaptcha: Persist the captcha consequence in the user session (T410657)]] [19:19:03] T410657: hCaptcha: Improve support for SiteKey verification - https://phabricator.wikimedia.org/T410657 [19:19:43] (03PS2) 10Dzahn: service::catalog: add gerrit-https and gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:20:02] (03CR) 10CI reject: [V:04-1] service::catalog: add gerrit-https and gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:21:02] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1215234|hCaptcha: Persist the captcha consequence in the user session (T410657)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [19:21:43] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1215240 [19:21:50] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [19:22:23] (03PS3) 10Dzahn: service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:24:29] !log kharlan@deploy2002 kharlan: Continuing with sync [19:26:07] (03PS4) 10Dzahn: service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:27:25] (03PS5) 10Dzahn: service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:27:56] (03CR) 10CI reject: [V:04-1] service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:28:19] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [19:29:08] (03PS6) 10Dzahn: service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:29:24] (03CR) 10Dzahn: service::catalog: add gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:29:38] (03CR) 10CI reject: [V:04-1] service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:30:15] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215234|hCaptcha: Persist the captcha consequence in the user session (T410657)]] (duration: 11m 16s) [19:30:19] T410657: hCaptcha: Improve support for SiteKey verification - https://phabricator.wikimedia.org/T410657 [19:30:42] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1215240 [19:30:47] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [19:32:04] (03PS7) 10Dzahn: service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:32:33] (03CR) 10CI reject: [V:04-1] service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:32:48] (03CR) 10Dzahn: service::catalog: add gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:35:07] (03CR) 10CDanis: service::catalog: add gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:35:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:36:21] (03PS1) 10Andrew Bogott: admin data: update yubikey pubkey for Andrew Bogott [puppet] - 10https://gerrit.wikimedia.org/r/1215242 [19:37:02] (03PS8) 10Dzahn: service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:37:18] (03CR) 10Andrew Bogott: [C:03+2] admin data: update yubikey pubkey for Andrew Bogott [puppet] - 10https://gerrit.wikimedia.org/r/1215242 (owner: 10Andrew Bogott) [19:38:18] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1215240 [19:38:24] (03CR) 10Dzahn: service::catalog: add gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:38:26] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [19:39:00] (03CR) 10CDanis: [C:03+1] service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:39:21] (03PS8) 10CDanis: tcpproxy: include profile::lvs::realserver in role [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:41:40] (03CR) 10CDanis: [C:04-1] "Please use instead: Ic8dc08993269f666b1360defd95abd7fb26813fb" [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:44:37] (03CR) 10CDanis: [C:04-1] WIP (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [19:45:00] (03Abandoned) 10Dzahn: tcpproxy: include profile::lvs::realserver in role [puppet] - 10https://gerrit.wikimedia.org/r/1203157 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [19:45:44] (03CR) 10Dzahn: service::catalog: add gerrit-ssh (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:45:45] (03CR) 10Dzahn: [C:03+2] service::catalog: add gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1214453 (https://phabricator.wikimedia.org/T365259) (owner: 10Jelto) [19:46:50] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [19:47:14] (03Abandoned) 10Ryan Kemper: elastic: reboot should check uptime not jvm start time [cookbooks] - 10https://gerrit.wikimedia.org/r/1207280 (https://phabricator.wikimedia.org/T410577) (owner: 10Ryan Kemper) [19:47:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:51:55] (03PS1) 10Superpes15: [tokwiki] Allow sysops to grant/remove confirmed status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215251 (https://phabricator.wikimedia.org/T411683) [19:52:56] (03PS1) 10Dzahn: service::catalog: fix conftool cluster name and disable paging for gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215252 (https://phabricator.wikimedia.org/T365259) [19:53:12] (03PS2) 10Dzahn: service::catalog: fix conftool cluster name and disable paging for gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215252 (https://phabricator.wikimedia.org/T365259) [19:53:14] (03CR) 10CI reject: [V:04-1] service::catalog: fix conftool cluster name and disable paging for gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215252 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:54:14] 10ops-eqiad, 06SRE, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411684#11434420 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Closing ticket. this look like on the grafna page... [19:54:17] (03CR) 10CDanis: [C:03+1] service::catalog: fix conftool cluster name and disable paging for gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215252 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:54:27] (03CR) 10Dzahn: [C:03+2] service::catalog: fix conftool cluster name and disable paging for gerrit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/1215252 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [19:59:44] (03PS1) 10Kosta Harlan: Use a separate right for Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215258 (https://phabricator.wikimedia.org/T411557) [20:00:22] and syncing another patch, unless there are any objections [20:00:58] !log import libvmod-netmapper 1.10-1~deb13+wmf1 into trixie-wikimedia - T401832 [20:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:01] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [20:01:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215258 (https://phabricator.wikimedia.org/T411557) (owner: 10Kosta Harlan) [20:01:44] this one will take a while, as it has i18n changes [20:03:18] (03PS1) 10Jforrester: Followup Ie40b9e59a4: Fortify unified metrics method [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215259 (https://phabricator.wikimedia.org/T411793) [20:08:57] (03PS1) 10Superpes15: [ukwiki] Limit thanks for newbie to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) [20:09:41] FIRING: [7x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:10:12] (03PS2) 10Superpes15: [ukwiki] Limit thanks for newbie to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) [20:10:17] (03PS1) 10Ejegg: Shorten 'close' cookie wait period for enwiki banners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215263 (https://phabricator.wikimedia.org/T411800) [20:12:50] (03CR) 10Greg Grossmeier: [C:03+1] "This was discussed in a call with Sam, Elliott, and myself (and a few others) and we agree to push this change out for now to save some of" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215263 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [20:13:26] !log import libvmod-querysort 0.4~deb13+wmf1 into trixie-wikimedia - T401832 [20:13:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:13:30] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [20:13:37] (03Merged) 10jenkins-bot: Use a separate right for Special:SuggestedInvestigations [extensions/CheckUser] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215258 (https://phabricator.wikimedia.org/T411557) (owner: 10Kosta Harlan) [20:13:56] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1215258|Use a separate right for Special:SuggestedInvestigations (T411557)]] [20:14:19] (03PS3) 10Superpes15: [ukwiki] Limit thanks for newbies to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) [20:14:41] FIRING: [14x] ConfdResourceFailed: confd resource _srv_config-master_pybal_codfw_gerrit-ssh.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:15:03] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215263 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [20:16:14] (03Abandoned) 10Andriy.v: Limit thanks for new users at uk.wikipedia to 3 per hour [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214636 (owner: 10Andriy.v) [20:21:31] (03PS3) 10Andrea Denisse: Add Thiemo Kreuz to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) [20:24:48] (03CR) 10A smart kitten: [C:03+1] "Code LGTM :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215251 (https://phabricator.wikimedia.org/T411683) (owner: 10Superpes15) [20:27:57] (03CR) 10Dzahn: Add Thiemo Kreuz to analytics_privatedata_users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:28:03] (03CR) 10A smart kitten: "question: You know more than me here so I'll defer to you, but has there been enough time for the community discussion to take place befor" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [20:28:06] (03CR) 10Dzahn: [C:03+1] "lgtm, one nitpick inline" [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:28:19] !log Delete libvmod-netmapper 1.10-1~deb13+wmf1, import libvmod-netmapper 1.10~deb13+wmf1 into trixie-wikimedia - T401832 [20:28:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:22] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [20:28:58] (03PS4) 10Andrea Denisse: Add Thiemo Kreuz to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) [20:29:29] (03CR) 10Dzahn: [C:03+1] Add Thiemo Kreuz to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:29:30] (03CR) 10Andrea Denisse: Add Thiemo Kreuz to analytics_privatedata_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:29:41] (03CR) 10CI reject: [V:04-1] Add Thiemo Kreuz to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:29:48] (03CR) 10Dzahn: [C:03+1] Add Thiemo Kreuz to analytics_privatedata_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:30:27] (03PS5) 10Andrea Denisse: Add Thiemo Kreuz to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) [20:32:53] (03CR) 10Superpes15: "Consensus seems clear and change shouldn't harm anyone, but since you've some concerns I'll wait and will schedule this during the next we" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [20:35:40] (03CR) 10Superpes15: "For total clarity, this concerns LTA activity, it should be a temporary patch, so I thought that wait was more harmful to the project than" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [20:38:57] (03CR) 10A smart kitten: "Thank you :) Yeah, I imagine it might not change anything; but at least then folks who don't check the wiki every day will get the chance " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [20:40:11] (03CR) 10A smart kitten: "(I replied before seeing your most recent message here. To be clear, if you think it's better to deploy it today then please feel free to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [20:40:29] (03CR) 10Andrea Denisse: [C:03+2] Add Thiemo Kreuz to analytics_privatedata_users [puppet] - 10https://gerrit.wikimedia.org/r/1215264 (https://phabricator.wikimedia.org/T411612) (owner: 10Andrea Denisse) [20:47:29] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11434569 (10andrea.denisse) 05In progress→03Resolved [20:50:12] !log import libvmod-wmfuniq 0.2.0~deb13+wmf1 into trixie-wikimedia - T401832 [20:50:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:15] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [20:50:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11434576 (10andrea.denisse) [20:56:51] (03CR) 10Superpes15: "Naah, no rush, you're right and we should wait ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [20:57:23] !log kharlan@deploy2002 kharlan: Backport for [[gerrit:1215258|Use a separate right for Special:SuggestedInvestigations (T411557)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:58:53] !log kharlan@deploy2002 kharlan: Continuing with sync [21:00:04] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: Time to snap out of that daydream and deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T2100). [21:00:05] maryum, James_F, AaronSchulz, Superpes, and ejegg: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] Heya. [21:00:28] I see kostajh is still deploying. [21:00:30] o/ [21:00:42] Yep [21:01:25] i'm here [21:01:48] !log taavi@deploy2002 mwscript-k8s job started: initEditCount --wiki=tokwiki [21:02:41] it's syncing out now [21:02:48] should be done in a few minutes [21:03:02] https://spiderpig.wikimedia.org/jobs/1043 [21:03:10] * James_F nods. [21:03:26] if anyone needs a deployer, happy to help -- otherwise self-deployers can self-queue/organize [21:03:31] I can deploy if no-one else volunteers. [21:03:34] !log import varnishkafka 1.2.0~deb13+wmf1 into trixie-wikimedia - T401832 [21:03:34] Hah, snap. [21:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:03:37] T401832: Upgrade Traffic hosts to trixie - https://phabricator.wikimedia.org/T401832 [21:04:37] Just FTR my patch is very simple, it can be merged together with any other patch, at your own discretion :) [21:04:49] Superpes: Yeah, I'll do yours first anyway. [21:04:56] (03PS1) 10Andrea Denisse: Add Caro Medelius to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1215267 (https://phabricator.wikimedia.org/T411543) [21:04:57] (03CR) 10Andrea Denisse: "Waiting for manager's explicit approval before merging." [puppet] - 10https://gerrit.wikimedia.org/r/1215267 (https://phabricator.wikimedia.org/T411543) (owner: 10Andrea Denisse) [21:05:48] * cjming thanks James_F [21:05:52] * James_F drums fingers waiting for sync. [21:05:57] (03CR) 10Superpes15: [C:04-1] "Just waiting a few other days to achieve a clear consensus" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215262 (https://phabricator.wikimedia.org/T411588) (owner: 10Superpes15) [21:06:24] Thanks @James_F :) [21:09:02] 06SRE, 10SRE-Access-Requests: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404#11434612 (10Raine) I am leaving this open as a reminder to delete the old key, but I'm currently unable to do that (blocked by T411816). If it's in the way, feel free to close it. [21:09:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11434615 (10VPuffetMichel) @andrea.denisse I approve this access for Caro. Thank you! [21:11:29] Okie-dokie, let's do maryum, Superpes, ejegg, and AaronSchulz's patches together. [21:11:38] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11434622 (10andrea.denisse) [21:11:41] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215258|Use a separate right for Special:SuggestedInvestigations (T411557)]] (duration: 57m 45s) [21:11:55] 10ops-eqiad, 06DC-Ops: Inbound errors on interface ssw1-e1-eqiad:xe-0/0/32 (Transport: lvs1020:enp94s0f0np0 (Equinix, 21996479) {#21989994}) - https://phabricator.wikimedia.org/T411818 (10phaultfinder) 03NEW [21:11:56] (03CR) 10Andrea Denisse: [C:03+2] Add Caro Medelius to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1215267 (https://phabricator.wikimedia.org/T411543) (owner: 10Andrea Denisse) [21:12:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215251 (https://phabricator.wikimedia.org/T411683) (owner: 10Superpes15) [21:12:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [21:12:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:12:15] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215263 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [21:13:27] (03Merged) 10jenkins-bot: [tokwiki] Allow sysops to grant/remove confirmed status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215251 (https://phabricator.wikimedia.org/T411683) (owner: 10Superpes15) [21:13:35] (03Merged) 10jenkins-bot: OATHAuth: Remove wmgOATHAuthDisableRight [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214659 (https://phabricator.wikimedia.org/T399664) (owner: 10Mstyles) [21:13:38] (03Merged) 10jenkins-bot: Remove /data-parsoid/ endpoint from specs per T393557 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1214143 (https://phabricator.wikimedia.org/T411517) (owner: 10Aaron Schulz) [21:13:42] (03Merged) 10jenkins-bot: Shorten 'close' cookie wait period for enwiki banners [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215263 (https://phabricator.wikimedia.org/T411800) (owner: 10Ejegg) [21:13:44] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, December 04 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:14:03] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1215251|[tokwiki] Allow sysops to grant/remove confirmed status (T411683)]], [[gerrit:1214659|OATHAuth: Remove wmgOATHAuthDisableRight (T399664)]], [[gerrit:1214143|Remove /data-parsoid/ endpoint from specs per T393557 (T411517)]], [[gerrit:1215263|Shorten 'close' cookie wait period for enwiki banners (T411800)]] [21:14:10] i'm going to slip into the tail end of this window if possible. [21:14:17] T411683: Allow tokwiki admins to grant and remove 'confirmed' - https://phabricator.wikimedia.org/T411683 [21:14:17] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [21:14:17] T393557: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557 [21:14:18] T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517 [21:14:18] T411800: CentralNotice code changes to show a banner to a reader with the 'waitdate: close' status - https://phabricator.wikimedia.org/T411800 [21:14:33] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for medelius - https://phabricator.wikimedia.org/T411543#11434640 (10andrea.denisse) 05In progress→03Resolved Closing as resolved, please let me know if there's anything else I can assist with. [21:14:34] cscott: First I've got mine and Moriel's backports, but sure. [21:14:44] (03PS1) 10Andrea Denisse: Add new SSH key for Zoe. [puppet] - 10https://gerrit.wikimedia.org/r/1215270 (https://phabricator.wikimedia.org/T411506) [21:15:05] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics_privatedata_users for Thiemo Kreuz (WMDE) - https://phabricator.wikimedia.org/T411612#11434671 (10andrea.denisse) Closing as resolved, please let me know if there's anything else I can assist with. [21:15:13] (03CR) 10Andrea Denisse: [C:03+2] "I confirmed with Zoe that this is her key." [puppet] - 10https://gerrit.wikimedia.org/r/1215270 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [21:17:52] thanks James_F, i'm seeing the new value at least on the test servers [21:18:01] Cool. [21:18:11] Or at least, most of the test servers. [21:18:17] !log jforrester@deploy2002 mstyles, aaron, superpes, jforrester, ejegg: Backport for [[gerrit:1215251|[tokwiki] Allow sysops to grant/remove confirmed status (T411683)]], [[gerrit:1214659|OATHAuth: Remove wmgOATHAuthDisableRight (T399664)]], [[gerrit:1214143|Remove /data-parsoid/ endpoint from specs per T393557 (T411517)]], [[gerrit:1215263|Shorten 'close' cookie wait period for enwiki banners (T411800)]] synced to the t [21:18:18] estservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:18:18] Two still haven't synced. [21:18:20] Aha. [21:18:31] @James_F Mine works fine too :) [21:18:36] Excellent. [21:19:32] !log jforrester@deploy2002 mstyles, aaron, superpes, jforrester, ejegg: Continuing with sync [21:20:12] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:20:57] (03PS1) 10Andrea Denisse: Remove unused SSH key for Zoe [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) [21:20:58] (03CR) 10Andrea Denisse: "To merge once she ensures access with her previous key." [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [21:23:31] James_F: thanks [21:23:50] Of course. Thanks for getting rid of old API endpoints. :-) [21:24:07] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215251|[tokwiki] Allow sysops to grant/remove confirmed status (T411683)]], [[gerrit:1214659|OATHAuth: Remove wmgOATHAuthDisableRight (T399664)]], [[gerrit:1214143|Remove /data-parsoid/ endpoint from specs per T393557 (T411517)]], [[gerrit:1215263|Shorten 'close' cookie wait period for enwiki banners (T411800)]] (duration: 10m 04s) [21:24:16] T411683: Allow tokwiki admins to grant and remove 'confirmed' - https://phabricator.wikimedia.org/T411683 [21:24:16] T399664: Expand 2FA Opt-In Privileges - https://phabricator.wikimedia.org/T399664 [21:24:17] T393557: Block external traffic to RESTBase /page/data-parsoid endpoint and investigate internal usage - https://phabricator.wikimedia.org/T393557 [21:24:17] T411517: Clean up Math API OpenAPI specs and remove data-parsoid route specs - https://phabricator.wikimedia.org/T411517 [21:24:17] T411800: CentralNotice code changes to show a banner to a reader with the 'waitdate: close' status - https://phabricator.wikimedia.org/T411800 [21:24:30] I'm here for my deploy! [21:24:32] is it too late? [21:24:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215259 (https://phabricator.wikimedia.org/T411793) (owner: 10Jforrester) [21:24:41] Many thanks for your assistance @James_F :3 [21:24:43] I can also deploy on my own with spiderpig [21:24:46] maryum: Already done. All good. [21:24:52] yay thank you so much! [21:24:58] Superpes: Happy to help. [21:25:04] maryum: Of course! Have a good Thursday. [21:25:16] James_F you too! [21:25:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624#11434692 (10andrea.denisse) 05Open→03In progress a:03andrea.denisse [21:25:28] (03CR) 10Jforrester: [C:03+2] CdxDialog: use-close-button prop needs to be set to true [extensions/WikiLambda] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215224 (https://phabricator.wikimedia.org/T411655) (owner: 10Jforrester) [21:26:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624#11434694 (10andrea.denisse) [21:30:20] * James_F drums fingers again. [21:37:28] (03Merged) 10jenkins-bot: Followup Ie40b9e59a4: Fortify unified metrics method [core] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215259 (https://phabricator.wikimedia.org/T411793) (owner: 10Jforrester) [21:37:31] (03Merged) 10jenkins-bot: CdxDialog: use-close-button prop needs to be set to true [extensions/WikiLambda] (wmf/1.46.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1215224 (https://phabricator.wikimedia.org/T411655) (owner: 10Jforrester) [21:37:48] !log jforrester@deploy2002 Started scap sync-world: Backport for [[gerrit:1215259|Followup Ie40b9e59a4: Fortify unified metrics method (T411793)]] [21:37:51] T411793: Fortify new API metrics method - https://phabricator.wikimedia.org/T411793 [21:37:53] Finally. [21:40:02] !log jforrester@deploy2002 jforrester: Backport for [[gerrit:1215259|Followup Ie40b9e59a4: Fortify unified metrics method (T411793)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:40:59] !log jforrester@deploy2002 jforrester: Continuing with sync [21:43:26] (03CR) 10Kamila Součková: [C:03+1] "yay, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1214509 (owner: 10Majavah) [21:45:04] !log jforrester@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215259|Followup Ie40b9e59a4: Fortify unified metrics method (T411793)]] (duration: 07m 16s) [21:45:08] T411793: Fortify new API metrics method - https://phabricator.wikimedia.org/T411793 [21:45:09] cscott: Over to you. [21:45:25] ok, fingers crossed on this one ;) [21:46:10] cscott: The last possible deploy slot just before everyone travels and there's no train next week; what could possibly go wrong? ;-) [21:46:18] exactly! [21:47:53] i'm only touching officewiki and test/test2 wiki though, so i hope the damage i can do is limited [21:47:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:49:22] (03PS3) 10C. Scott Ananian: Activate postprocessing cache on testwiki, test2wiki, officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:51:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:52:01] (03Merged) 10jenkins-bot: Activate postprocessing cache on testwiki, test2wiki, officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215115 (https://phabricator.wikimedia.org/T348255) (owner: 10Isabelle Hurbain-Palatin) [21:52:22] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1215115|Activate postprocessing cache on testwiki, test2wiki, officewiki (T348255)]] [21:52:26] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [21:54:27] !log cscott@deploy2002 ihurbain, cscott: Backport for [[gerrit:1215115|Activate postprocessing cache on testwiki, test2wiki, officewiki (T348255)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:59:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251204T2200) [22:00:56] finishing up, just testing still [22:02:40] Hey all - would like to get a couple of security patches out if backports are wrapping up and the Web Team won’t be using their window. [22:02:41] !log cscott@deploy2002 ihurbain, cscott: Continuing with sync [22:04:20] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624#11434841 (10andrea.denisse) [22:04:25] yup, i'm just wrapping up (from backports), can't speak for web team. [22:05:14] thanks James_F for always being lovely and a wonderful deployer :) [22:05:34] greg-g: Always. [22:05:50] greg-g: Thanks to FRT for being wonderful people doing great work. [22:05:55] we try! [22:06:45] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1215115|Activate postprocessing cache on testwiki, test2wiki, officewiki (T348255)]] (duration: 14m 23s) [22:06:49] T348255: Parser cache infrastructure for OutputTransform - https://phabricator.wikimedia.org/T348255 [22:07:26] (03PS1) 10Hubaishan: [config] arwiktionary: add 2 namespaces with talks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215280 (https://phabricator.wikimedia.org/T411819) [22:08:13] (03CR) 10CI reject: [V:04-1] [config] arwiktionary: add 2 namespaces with talks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215280 (https://phabricator.wikimedia.org/T411819) (owner: 10Hubaishan) [22:09:01] (03PS1) 10Andrea Denisse: Add Riku Silvola to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1215281 (https://phabricator.wikimedia.org/T411624) [22:11:08] !log ryankemper@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on stat[1008-1011].eqiad.wmnet with reason: T411568 [22:11:13] T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 [22:11:15] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Riku Silvola - https://phabricator.wikimedia.org/T411624#11434872 (10andrea.denisse) 05In progress→03Resolved Closing as resolved, please let me know if there's anything else I can assist with. [22:12:32] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11434875 (10Dzahn) We have to switch these hosts from nftables back to ferm as firewall provider. Reason: liberica does not support nftables yet. [22:14:05] i'm done, thanks James_F for letting me stretch the window. ;) [22:14:43] i learned fun new things about FlaggedRevisions and what's a day like without some surprising new thing to learn? [22:16:04] (03PS1) 10Dzahn: tcpproxy: switch firewall provider from nftables to ferm [puppet] - 10https://gerrit.wikimedia.org/r/1215284 (https://phabricator.wikimedia.org/T408532) [22:18:16] Ok, starting on the 2 sec deploys unless there are any objections... [22:18:28] (03PS2) 10Hubaishan: [config] arwiktionary: add 2 namespaces with talks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215280 (https://phabricator.wikimedia.org/T411819) [22:18:31] none from me [22:19:49] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 08 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1215280 (https://phabricator.wikimedia.org/T411819) (owner: 10Hubaishan) [22:20:44] !log T411568 Rebooting `stat*` [22:20:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:48] T411568: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568 [22:22:16] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 14 hosts with reason: T408532 [22:22:20] T408532: Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532 [22:23:08] (03CR) 10Dzahn: [C:03+2] "puppet disabled on all - downtimed all - .. switching a single one first" [puppet] - 10https://gerrit.wikimedia.org/r/1215284 (https://phabricator.wikimedia.org/T408532) (owner: 10Dzahn) [22:28:25] !log Deployed security fix for T408135 [22:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:29:54] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:30:41] 06SRE, 10SRE-Access-Requests, 06Fundraising-Backlog: Requesting access to analytics-privatedata-users for astein - https://phabricator.wikimedia.org/T411679#11434945 (10andrea.denisse) 05In progress→03Resolved Closing as resolved, feel free to reopen if there's anything else I can assist with. [22:31:26] 06SRE, 10SRE-Access-Requests: Add FIDO-backed SSH key for brennen - https://phabricator.wikimedia.org/T411730#11434954 (10andrea.denisse) Hi folks, the patch for this task is merged. Can we close it as resolved? [22:35:23] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster [22:36:26] 06SRE, 06Data-Platform-SRE (2025.11.07 - 2025.11.28), 13Patch-For-Review: October 2025 Bullseye reboots: Data Platform Engineering-owned hosts - https://phabricator.wikimedia.org/T411568#11434969 (10RKemper) Stat host reboots completed. Shifting gears to rebooting `an-test*`. Note there's still lots of `an-... [22:37:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11434970 (10andrea.denisse) [22:37:10] !log Deployed security fix for T409226 [22:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:23] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11434984 (10andrea.denisse) 05In progress→03Stalled >>! In T411436#11426873, @andrea.denisse wrote: >>>! In T411436#11425341, @SEgt-WMF wrote: >> In c... [22:42:47] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-cluster [22:42:47] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-cluster (exit_code=99) [22:44:52] Sec deploys done, thanks. [22:46:14] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy1002.eqiad.wmnet [22:47:13] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy2001.codfw.wmnet [22:48:08] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy2002.codfw.wmnet [22:48:20] 14SRE-Sprint-Week-Sustainability-March2023, 10Beta-Cluster-Infrastructure, 06DBA, 10MediaWiki-libs-Rdbms, 07Epic: Enable MariaDB/MySQL's Strict Mode - https://phabricator.wikimedia.org/T108255#11435001 (10Reedy) [22:49:00] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy3001.esams.wmnet [22:50:05] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy1002.eqiad.wmnet [22:50:24] 06SRE, 10SRE-Access-Requests: Add FIDO-backed SSH key for brennen - https://phabricator.wikimedia.org/T411730#11435007 (10brennen) 05Open→03Resolved a:03brennen Confirmed new key is working against production machines. Thanks! [22:50:42] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4001.ulsfo.wmnet [22:51:02] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy2001.codfw.wmnet [22:51:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4001.ulsfo.wmnet [22:51:33] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy3002.esams.wmnet [22:51:39] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy4002.ulsfo.wmnet [22:51:49] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy2002.codfw.wmnet [22:52:35] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy5001.eqsin.wmnet [22:52:55] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy3001.esams.wmnet [22:54:54] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST secrets) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:55:12] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:55:23] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy4002.ulsfo.wmnet [22:55:28] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy3002.esams.wmnet [22:55:32] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy5002.eqsin.wmnet [22:55:56] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy6001.drmrs.wmnet [22:56:14] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy6002.drmrs.wmnet [22:56:37] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy5001.eqsin.wmnet [22:58:30] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy7001.magru.wmnet [22:59:34] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy5002.eqsin.wmnet [22:59:52] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy6001.drmrs.wmnet [23:00:11] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy6002.drmrs.wmnet [23:00:22] !log dzahn@cumin2002 START - Cookbook sre.hosts.reboot-single for host tcp-proxy7002.magru.wmnet [23:02:27] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy7001.magru.wmnet [23:04:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host tcp-proxy7002.magru.wmnet [23:05:59] 06SRE, 06collaboration-services, 06Traffic, 06Release-Engineering-Team (Radar): Deploy a TCP proxy across all DCs - https://phabricator.wikimedia.org/T408532#11435041 (10Dzahn) downtimed, ran puppet, rebooted the 14 VMs and verified ferm service is running via cumin/cookbook. they are all on ferm now. [23:07:38] (03CR) 10Dzahn: "I have switched the tcp-proxy VMs to ferm now." [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:10:28] (03PS4) 10Dzahn: tcpproxy: add lvs::realserver:* to puppet role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:11:04] (03CR) 10Dzahn: tcpproxy: add lvs::realserver:* to puppet role (WIP) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:12:47] (03PS5) 10Dzahn: tcpproxy: add lvs::realserver:* to puppet role (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:13:01] (03CR) 10Dzahn: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:15:14] (03PS3) 10Dzahn: miscweb: add wikipedia25.org to extra SANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) [23:16:40] !log removing 5 files for legal compliance [23:16:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:45] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster [23:17:56] (03CR) 10Dzahn: "for context on the current status of that domain: https://phabricator.wikimedia.org/T408168" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1215225 (https://phabricator.wikimedia.org/T408592) (owner: 10Dzahn) [23:19:20] (03CR) 10Dzahn: [V:03+1] "compiler output shows noop on all the hosts in the commit message and on tcpproxy itself: https://puppet-compiler.wmflabs.org/output/12152" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:20:11] FIRING: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [23:23:07] !log removing 3 files for legal compliance [23:23:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:00] 06SRE, 06collaboration-services, 13Patch-For-Review: setup gerrit2003 with gerrit service (gerrit on bookworm) - https://phabricator.wikimedia.org/T372804#11435088 (10Dzahn) 05In progress→03Resolved I am happy to be convinced otherwise and if you want to reopen it that's not a big deal to me. But all... [23:25:11] RESOLVED: Temperature: Temp issue on wdqs1023:9290 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook - https://grafana.wikimedia.org/d/ZA1I-IB4z/ipmi-sensor-state?orgId=1&viewPanel=92&var-server=wdqs1023 - https://alerts.wikimedia.org/?q=alertname%3DTemperature [23:27:01] (03CR) 10Dzahn: [C:03+1] Remove unused SSH key for Zoe [puppet] - 10https://gerrit.wikimedia.org/r/1215273 (https://phabricator.wikimedia.org/T411506) (owner: 10Andrea Denisse) [23:30:12] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:34:56] !log removing 2 files for legal compliance [23:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:36:57] (03PS1) 10Dzahn: trafficserver: add a map for gerrit as a backend [puppet] - 10https://gerrit.wikimedia.org/r/1215317 (https://phabricator.wikimedia.org/T365259) [23:38:07] (03CR) 10Dzahn: "[cumin2002:~] $ host gerrit.discovery.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/1215317 (https://phabricator.wikimedia.org/T365259) (owner: 10Dzahn) [23:44:02] (03CR) 10CDanis: [C:03+1] "LGTM ship it ! puppet should configure the servers with extra loopback addresses with the v4 and v6 in each cluster" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:45:18] (03PS6) 10Dzahn: tcpproxy: add lvs::realserver:* to puppet role [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:47:05] FIRING: KubernetesCalicoDown: ml-serve1013.eqiad.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1013.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [23:47:50] !log removing 4 files for legal compliance [23:47:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:48:18] (03CR) 10Dzahn: [C:03+2] "thanks! doing" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis) [23:53:53] (03CR) 10Dzahn: [C:03+2] "tested on tcp-proxy1001 first - I got 4 new interfaces: tunl0@NONE, ipip0@NONE, ip6tnl0@NONE and ipip60@NONE. And I got the gerrit-lb IP o" [puppet] - 10https://gerrit.wikimedia.org/r/1215240 (owner: 10CDanis)