[00:01:10] (03PS1) 10DDesouza: Deploy 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) [00:01:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:19:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [00:21:58] (03PS1) 10Andrew Bogott: Revert "codfw1dev: roll back horizon version to 2025-06-23-141023" [puppet] - 10https://gerrit.wikimedia.org/r/1213129 [00:30:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:39:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213130 [00:39:52] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213130 (owner: 10TrainBranchBot) [00:40:10] (03CR) 10Andrew Bogott: [C:03+2] Revert "codfw1dev: roll back horizon version to 2025-06-23-141023" [puppet] - 10https://gerrit.wikimedia.org/r/1213129 (owner: 10Andrew Bogott) [00:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:52:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213130 (owner: 10TrainBranchBot) [01:00:48] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:10:04] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213131 [01:10:04] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213131 (owner: 10TrainBranchBot) [01:13:23] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 34s) [01:24:49] (03CR) 10CI reject: [V:04-1] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213131 (owner: 10TrainBranchBot) [01:27:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1230.eqiad.wmnet with reason: Maintenance [01:27:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1230 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86230 and previous config saved to /var/cache/conftool/dbconfig/20251201-012716-marostegui.json [01:27:20] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [01:27:21] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:22:58] getting intermittent timeouts [02:26:32] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [02:40:00] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [02:53:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86231 and previous config saved to /var/cache/conftool/dbconfig/20251201-025347-marostegui.json [02:53:52] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [02:53:52] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [02:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:58:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:01:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:03:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:03:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [03:08:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P86232 and previous config saved to /var/cache/conftool/dbconfig/20251201-030855-marostegui.json [03:24:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230', diff saved to https://phabricator.wikimedia.org/P86233 and previous config saved to /var/cache/conftool/dbconfig/20251201-032402-marostegui.json [03:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:31:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [03:36:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2015:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:39:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1230 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86234 and previous config saved to /var/cache/conftool/dbconfig/20251201-033910-marostegui.json [03:39:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [03:39:15] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [03:39:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1245.eqiad.wmnet with reason: Maintenance [03:41:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:48:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:50:13] FIRING: [2x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2014:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [03:53:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:00:13] FIRING: [2x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2008:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:03:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:05:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:08:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:10:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:15:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:30:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:30:13] FIRING: [4x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:35:13] FIRING: [4x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:50:13] FIRING: [6x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:56:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [04:58:02] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:00:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:10:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:25:13] FIRING: [3x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2011:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:35:00] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:35:13] FIRING: [2x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:36:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1163.eqiad.wmnet with reason: Maintenance [05:40:13] FIRING: [2x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2012:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:40:18] (03PS1) 10Marostegui: filtered_tables.txt: Remove rc_type [puppet] - 10https://gerrit.wikimedia.org/r/1213136 (https://phabricator.wikimedia.org/T410531) [05:42:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2212.codfw.wmnet with reason: Maintenance [05:44:27] (03CR) 10Marostegui: [C:03+2] filtered_tables.txt: Remove rc_type [puppet] - 10https://gerrit.wikimedia.org/r/1213136 (https://phabricator.wikimedia.org/T410531) (owner: 10Marostegui) [06:45:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [06:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:07:55] (03PS1) 10KartikMistry: Update cxserver to 2025-11-28-062930-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213141 [07:19:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr2-eqord:xe-0/1/4 (Peering: Equinix (Wikimedia-CH2-IX-01 Chicago, MAC Filter, ... [07:19:51] SR17915277) {#11374}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqord:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:21:03] looking [07:23:21] here too [07:24:45] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [07:24:46] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [07:24:51] FIRING: [2x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:25:27] acked ^ [07:28:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 613291720 and 43 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:29:45] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [07:29:51] FIRING: [3x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/3/2 (Transit: Lumen (442550281) {#3867}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [07:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:32:50] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 32216 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:34:45] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [07:34:51] FIRING: [6x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:35:01] (03PS1) 10Brouberol: Rename testkitchen domains to test-kitchen [dns] - 10https://gerrit.wikimedia.org/r/1213271 (https://phabricator.wikimedia.org/T407805) [07:41:32] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:47:18] (03PS3) 10Brouberol: test-kitchen: allow reaching out to the mpic app via test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212418 (https://phabricator.wikimedia.org/T407805) [07:47:18] (03PS3) 10Brouberol: test-kitchen: add the additional test-kitchen.w.o domain to the ingress gateway hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212419 (https://phabricator.wikimedia.org/T407805) [07:47:18] (03PS3) 10Brouberol: test-kitchen-next: set the OIDC callback URL domain to test-kitchen-next.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212420 (https://phabricator.wikimedia.org/T407805) [07:47:18] (03PS3) 10Brouberol: test-kitchen: set the OIDC callback URL domain to test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212421 (https://phabricator.wikimedia.org/T407805) [07:47:19] (03PS3) 10Brouberol: Rename mpic-next service to test-kitchen-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212422 (https://phabricator.wikimedia.org/T407805) [07:47:20] (03PS3) 10Brouberol: Rename mpic service to test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212423 (https://phabricator.wikimedia.org/T407805) [07:47:24] (03PS3) 10Brouberol: test-kitchen: drop the mpic.w.o SANs from the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212424 (https://phabricator.wikimedia.org/T407805) [07:47:28] (03PS3) 10Brouberol: testkitchen: drop the mpic.w.o domains from the ingress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212425 (https://phabricator.wikimedia.org/T407805) [07:47:32] (03PS3) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) [07:48:57] (03CR) 10Slyngshede: [C:03+1] "Patch looks good. I'll leave it to others to judge if it can be deleted." [puppet] - 10https://gerrit.wikimedia.org/r/1212529 (owner: 10Muehlenhoff) [07:49:17] (03CR) 10Slyngshede: [C:03+1] Replace Leo as group approver with Hugh [puppet] - 10https://gerrit.wikimedia.org/r/1212530 (owner: 10Muehlenhoff) [07:49:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mszwarc@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212584 (https://phabricator.wikimedia.org/T392775) (owner: 10Mszwarc) [07:53:50] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 602630616 and 44 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:54:01] (03Merged) 10jenkins-bot: Fix mw-userlink class being added too broadly [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212584 (https://phabricator.wikimedia.org/T392775) (owner: 10Mszwarc) [07:54:51] FIRING: [6x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [07:55:03] !log mszwarc@deploy2002 Started scap sync-world: Backport for [[gerrit:1212584|Fix mw-userlink class being added too broadly (T392775)]] [07:55:06] T392775: Add link color for temporary usernames in content and discussion pages - https://phabricator.wikimedia.org/T392775 [07:57:48] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 116128 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [07:58:14] (03PS2) 10Brouberol: test-kitchen: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212437 (https://phabricator.wikimedia.org/T407805) [07:58:14] (03PS3) 10Brouberol: test-kitchen: allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) [07:58:14] (03PS2) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [07:58:14] (03PS3) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [07:58:15] (03PS3) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [07:58:17] (03PS3) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [07:58:21] (03PS3) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [07:58:25] (03PS3) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [07:58:29] (03PS1) 10Brouberol: Define the test-kitchen kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1213274 (https://phabricator.wikimedia.org/T407805) [07:58:33] (03PS1) 10Brouberol: Define the test-kitchen services [puppet] - 10https://gerrit.wikimedia.org/r/1213275 (https://phabricator.wikimedia.org/T407805) [07:59:45] FIRING: [4x] Primary outbound port utilisation over 80% #page: Alert for device cr1-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [07:59:51] FIRING: [6x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-codfw:xe-1/0/1:0 (Transit: Arelion (IC-308846) {#10905_12273-1}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [08:00:01] <_joe_> uhm [08:00:05] Amir1, Urbanecm, and awight: That opportune time for a UTC morning backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T0800). [08:00:05] Msz2001: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:08] <_joe_> didn't block the issue? [08:00:13] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2022:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:00:33] I'm here and deploying [08:01:09] go for it. I'm around if you need help [08:04:45] FIRING: [4x] Primary outbound port utilisation over 80% #page: Device cr1-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:09:51] FIRING: [5x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [08:14:33] (03CR) 10Btullis: [C:03+1] Rename testkitchen domains to test-kitchen [dns] - 10https://gerrit.wikimedia.org/r/1213271 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:14:45] RESOLVED: [3x] Primary outbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [08:14:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [08:14:51] RESOLVED: [5x] TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr1-eqiad:xe-3/0/6 (Peering: Equinix (21958836-A 111916-DC6-IX-02, MAC filter) {#2009}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [08:15:10] (03CR) 10Btullis: [C:03+1] Define the test-kitchen kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1213274 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:16:03] (03CR) 10Btullis: [C:03+1] test-kitchen: allow reaching out to the mpic app via test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212418 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:17:49] (03CR) 10Brouberol: [C:03+2] Rename testkitchen domains to test-kitchen [dns] - 10https://gerrit.wikimedia.org/r/1213271 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:18:05] !log brouberol@dns1004 START - running authdns-update [08:18:35] !log mszwarc@deploy2002 mszwarc: Backport for [[gerrit:1212584|Fix mw-userlink class being added too broadly (T392775)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [08:18:37] T392775: Add link color for temporary usernames in content and discussion pages - https://phabricator.wikimedia.org/T392775 [08:19:15] !log brouberol@dns1004 END - running authdns-update [08:19:48] (03CR) 10Brouberol: [C:03+2] Define the test-kitchen kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1213274 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:19:55] !log mszwarc@deploy2002 mszwarc: Continuing with sync [08:22:40] (03CR) 10Brouberol: [C:03+2] test-kitchen: allow reaching out to the mpic app via test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212418 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:29:05] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [08:29:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [08:30:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:33:38] !log mszwarc@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212584|Fix mw-userlink class being added too broadly (T392775)]] (duration: 38m 35s) [08:33:41] T392775: Add link color for temporary usernames in content and discussion pages - https://phabricator.wikimedia.org/T392775 [08:35:39] I'm done with deploying [08:36:17] (03PS4) 10Brouberol: test-kitchen: add the additional test-kitchen.w.o domain to the ingress gateway hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212419 (https://phabricator.wikimedia.org/T407805) [08:36:17] (03PS4) 10Brouberol: test-kitchen-next: set the OIDC callback URL domain to test-kitchen-next.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212420 (https://phabricator.wikimedia.org/T407805) [08:36:17] (03PS4) 10Brouberol: test-kitchen: set the OIDC callback URL domain to test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212421 (https://phabricator.wikimedia.org/T407805) [08:36:17] (03PS4) 10Brouberol: Rename mpic-next service to test-kitchen-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212422 (https://phabricator.wikimedia.org/T407805) [08:36:18] (03PS4) 10Brouberol: Rename mpic service to test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212423 (https://phabricator.wikimedia.org/T407805) [08:36:20] (03PS4) 10Brouberol: test-kitchen: drop the mpic.w.o SANs from the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212424 (https://phabricator.wikimedia.org/T407805) [08:36:24] (03PS4) 10Brouberol: testkitchen: drop the mpic.w.o domains from the ingress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212425 (https://phabricator.wikimedia.org/T407805) [08:36:28] (03PS4) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) [08:36:32] (03PS1) 10Brouberol: Restore the mpic.discovery.wmnet dnsName in the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213416 [08:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:40:31] (03CR) 10Muehlenhoff: [C:03+2] Mark Tyler as group approver for deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1212057 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [08:41:31] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11418351 (10MoritzMuehlenhoff) [08:45:00] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:47:33] (03CR) 10Brouberol: [C:03+2] Restore the mpic.discovery.wmnet dnsName in the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213416 (owner: 10Brouberol) [08:50:33] !log upgrade Envoy on config-master* T405808 [08:50:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:36] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [08:52:18] (03PS5) 10Brouberol: test-kitchen: add the additional test-kitchen.w.o domain to the ingress gateway hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212419 (https://phabricator.wikimedia.org/T407805) [08:52:18] (03PS5) 10Brouberol: test-kitchen-next: set the OIDC callback URL domain to test-kitchen-next.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212420 (https://phabricator.wikimedia.org/T407805) [08:52:18] (03PS5) 10Brouberol: test-kitchen: set the OIDC callback URL domain to test-kitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212421 (https://phabricator.wikimedia.org/T407805) [08:52:19] (03PS5) 10Brouberol: Rename mpic-next service to test-kitchen-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212422 (https://phabricator.wikimedia.org/T407805) [08:52:20] (03PS5) 10Brouberol: Rename mpic service to test-kitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212423 (https://phabricator.wikimedia.org/T407805) [08:52:21] (03PS5) 10Brouberol: test-kitchen: drop the mpic.w.o SANs from the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212424 (https://phabricator.wikimedia.org/T407805) [08:52:26] (03PS5) 10Brouberol: testkitchen: drop the mpic.w.o domains from the ingress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212425 (https://phabricator.wikimedia.org/T407805) [08:52:30] (03PS5) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) [08:52:34] (03PS1) 10Brouberol: Restore the mpic.discovery.wmnet dnsName in the certificate (2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213418 [08:52:52] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [08:58:21] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2151.codfw.wmnet with reason: Maintenance [08:58:29] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T410589)', diff saved to https://phabricator.wikimedia.org/P86235 and previous config saved to /var/cache/conftool/dbconfig/20251201-085828-ladsgroup.json [08:58:31] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [09:11:27] (03CR) 10Brouberol: [C:03+2] Restore the mpic.discovery.wmnet dnsName in the certificate (2) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213418 (owner: 10Brouberol) [09:13:57] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:14:10] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:14:12] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled: thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:14:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [09:15:00] FIRING: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:15:10] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:15:10] looking, even though I'm sure godog will know more about that service :) [09:15:12] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:15:24] heheh [09:15:30] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [09:15:34] yes [09:15:48] !incidents [09:15:48] 7071 (UNACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [09:15:48] 7068 (RESOLVED) TransitPeeringTransportOutboundSaturation network sre (cr2-eqord:9804 Peering: Equinix (Wikimedia-CH2-IX-01 Chicago, MAC Filter, SR17915277) {#11374} xe-0/1/4 gnmi eqiad) [09:15:49] 7070 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [09:15:49] 7069 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (cr2-eqord.wikimedia.org) [09:15:52] !ack 7071 [09:15:53] 7071 (ACKED) ProbeDown sre (10.2.2.53 ip4 thanos-query:443 probes/service http_thanos-query_ip4 eqiad) [09:16:08] thanos got OOM-killed on titan1001 [09:16:13] the same happened on Friday as well, BTW [09:16:27] sigh, ok thank you moritzm ! [09:16:30] tappof: FYI ^ [09:16:52] is that service a single node? Should it page if only 1 process goes down ? [09:17:13] thanos is two hosts per site [09:17:34] normally both a pooled, not sure about the current situation [09:18:32] both are pooled, but with one node down, we're hitting the pybal threshold [09:18:57] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:19:16] good point [09:19:58] should I re-open https://phabricator.wikimedia.org/T356788 or file a new task? [09:20:00] RESOLVED: ProbeDown: Service thanos-query:443 has failed probes (http_thanos-query_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thanos-query:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:00] FIRING: [2x] HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:21:06] I'll file a new task, it is thanos-store eating lots of memory [09:22:33] there is a revert that tappof is planning to do today regarding thanos-store, it may be related :( [09:23:38] (03PS3) 10Majavah: P:cloudceph::osd: Convert drange to an array [puppet] - 10https://gerrit.wikimedia.org/r/1212138 [09:25:12] 06SRE, 10observability: thanos-store OOMing on titan eqiad - https://phabricator.wikimedia.org/T411343 (10fgiunchedi) 03NEW [09:26:12] (03CR) 10Majavah: [C:03+2] P:cloudceph::osd: Convert drange to an array [puppet] - 10https://gerrit.wikimedia.org/r/1212138 (owner: 10Majavah) [09:26:14] 06SRE, 10observability: thanos-store OOMing on titan eqiad - https://phabricator.wikimedia.org/T411343#11418539 (10fgiunchedi) I'm aware there is/was work going on on thanos/titan in {T410152} and perhaps related [09:26:34] ^^ thank you, I'll take a look [09:27:11] (03CR) 10Volans: [C:03+2] wmcs infra-tracing: optimize Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [09:27:32] taavi: if you get my change too feel free to merge it [09:27:36] *got [09:28:01] volans: mine merged already [09:28:14] ack thx [09:28:17] (03PS2) 10Majavah: P:toolforge::prometheus: Collect metrics for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1212186 (https://phabricator.wikimedia.org/T399313) [09:30:18] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Collect metrics for infra-tracing-loki [puppet] - 10https://gerrit.wikimedia.org/r/1212186 (https://phabricator.wikimedia.org/T399313) (owner: 10Majavah) [09:34:06] (03PS4) 10Majavah: hieradata: cloudgw: Configure individual v6 networks [puppet] - 10https://gerrit.wikimedia.org/r/1211667 (https://phabricator.wikimedia.org/T411081) [09:36:52] (03CR) 10Majavah: [C:03+2] hieradata: cloudgw: Configure individual v6 networks [puppet] - 10https://gerrit.wikimedia.org/r/1211667 (https://phabricator.wikimedia.org/T411081) (owner: 10Majavah) [09:37:31] (03Abandoned) 10Majavah: cloudgw: eqiad: introduce openstack octavia support [puppet] - 10https://gerrit.wikimedia.org/r/1147794 (https://phabricator.wikimedia.org/T394099) (owner: 10Arturo Borrero Gonzalez) [09:39:57] !log installing expat security updates [09:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:15] (03CR) 10Majavah: [C:03+2] Add former Toki Pona language codes [dns] - 10https://gerrit.wikimedia.org/r/1212577 (https://phabricator.wikimedia.org/T404507) (owner: 10Majavah) [09:52:22] !log taavi@dns1004 START - running authdns-update [09:53:16] !log jmm@cumin2002 START - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors rolling restart_daemons on A:logstash-collector [09:53:33] !log taavi@dns1004 END - running authdns-update [09:55:24] PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100% [10:00:05] this is me --^ [10:00:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.o11y.roll-restart-reboot-logstash-collectors (exit_code=0) rolling restart_daemons on A:logstash-collector [10:04:37] !log ayounsi@cumin1003 START - Cookbook sre.dns.netbox [10:06:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#11418705 (10elukey) To keep archives happy - the ml-serve1012 and 1013 hosts have been removed from the analytics vlan. [10:10:21] ayounsi@cumin1003 netbox (PID 3985016) is awaiting input [10:11:33] !log ayounsi@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change ml-serve1013 vlan - ayounsi@cumin1003" [10:11:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Change ml-serve1013 vlan - ayounsi@cumin1003" [10:11:37] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:11:58] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [10:12:06] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [10:12:36] (03PS1) 10Majavah: hieradata: Update Toolforge web proxy address [puppet] - 10https://gerrit.wikimedia.org/r/1213425 [10:13:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [10:13:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/postgresql-airflow-main: apply [10:14:07] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie [10:15:16] !log a-pizzata@deploy2002 Started deploy [analytics/refinery@fa63f82] (hadoop-test): Analytics train TEST [analytics/refinery@fa63f82e] [10:16:02] (03PS1) 10Filippo Giunchedi: pontoon: extend README with dummy network interfaces [puppet] - 10https://gerrit.wikimedia.org/r/1213426 [10:16:10] (03CR) 10Majavah: [C:03+2] hieradata: Update Toolforge web proxy address [puppet] - 10https://gerrit.wikimedia.org/r/1213425 (owner: 10Majavah) [10:16:24] !log a-pizzata@deploy2002 Finished deploy [analytics/refinery@fa63f82] (hadoop-test): Analytics train TEST [analytics/refinery@fa63f82e] (duration: 01m 08s) [10:17:29] !log a-pizzata@deploy2002 Started deploy [analytics/refinery@fa63f82]: Regular analytics train [analytics/refinery@fa63f82e] [10:20:23] !log a-pizzata@deploy2002 Finished deploy [analytics/refinery@fa63f82]: Regular analytics train [analytics/refinery@fa63f82e] (duration: 02m 54s) [10:22:43] (03CR) 10Brouberol: [C:03+2] test-kitchen: add the additional test-kitchen.w.o domain to the ingress gateway hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212419 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:22:43] (03CR) 10Brouberol: [C:03+2] Define the test-kitchen services [puppet] - 10https://gerrit.wikimedia.org/r/1213275 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:23:05] !log javiermonton@deploy2002 Started deploy [analytics/refinery@fa63f82]: Regular analytics train [analytics/refinery@fa63f82e] [10:23:33] !log javiermonton@deploy2002 Finished deploy [analytics/refinery@fa63f82]: Regular analytics train [analytics/refinery@fa63f82e] (duration: 00m 28s) [10:26:17] PROBLEM - Host cloudgw1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:27:45] RECOVERY - Host cloudgw1003 is UP: PING OK - Packet loss = 0%, RTA = 0.36 ms [10:28:43] !log brouberol@deploy2002 Started deploy [analytics/refinery@fa63f82]: Regular analytics train [analytics/refinery@fa63f82e] [10:29:52] !log brouberol@deploy2002 Finished deploy [analytics/refinery@fa63f82]: Regular analytics train [analytics/refinery@fa63f82e] (duration: 01m 09s) [10:32:15] PROBLEM - Host cloudgw1004 is DOWN: PING CRITICAL - Packet loss = 100% [10:32:29] !log joal@deploy2002 Started deploy [analytics/refinery@fa63f82] (thin): Regular analytics train THIN [analytics/refinery@fa63f82e] [10:32:53] RECOVERY - Host cloudgw1004 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [10:33:01] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [10:33:27] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [10:33:42] !log joal@deploy2002 Finished deploy [analytics/refinery@fa63f82] (thin): Regular analytics train THIN [analytics/refinery@fa63f82e] (duration: 01m 13s) [10:35:31] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [10:35:52] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [10:39:45] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1013.eqiad.wmnet with OS trixie [10:40:03] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie [10:42:48] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1013.eqiad.wmnet with OS trixie [10:43:06] (03PS4) 10Brouberol: test-kitchen-next: allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) [10:43:06] (03PS3) 10Brouberol: test-kitchen-next: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212437 (https://phabricator.wikimedia.org/T407805) [10:43:06] (03PS3) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [10:43:07] (03PS4) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [10:43:08] (03PS4) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [10:43:10] (03PS4) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [10:43:14] (03PS4) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [10:43:18] (03PS4) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [10:43:22] (03PS1) 10Brouberol: test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 [10:43:26] (03PS1) 10Brouberol: test-kitchen: allow both mpic/test-kitchen domains in the OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1213428 [10:43:58] (03PS4) 10Brouberol: test-kitchen-next: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212437 (https://phabricator.wikimedia.org/T407805) [10:43:59] (03PS5) 10Brouberol: test-kitchen-next: allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) [10:43:59] (03PS2) 10Brouberol: test-kitchen: allow both mpic/test-kitchen domains in the OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1213428 [10:43:59] (03PS2) 10Brouberol: test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 [10:44:00] (03PS4) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [10:44:01] (03PS5) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [10:44:05] (03PS5) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [10:44:09] (03PS5) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [10:44:13] (03PS5) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [10:44:17] (03PS5) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [10:45:56] (03PS3) 10A smart kitten: enwikibooks: Limit FlaggedRevs to specific namespaces; disable FR stable-transclusion-checking [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) [10:46:11] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:46:24] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ml-serve1013.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART [10:47:24] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve1013 [10:47:48] !log upgrade Envoy on matomo1001 T405808 [10:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:50] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [10:49:45] (03CR) 10Santiago Faci: testkitchen: rename the OIDC services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:51:00] (03CR) 10Brouberol: testkitchen: rename the OIDC services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:51:32] !log Deployed refinery using scap, then deployed onto hdfs [10:51:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:28] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve1013 [10:53:11] (03PS6) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) [10:54:49] (03CR) 10Brouberol: [C:03+2] test-kitchen-next: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212437 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:55:27] (03CR) 10Santiago Faci: testkitchen: reconfigure the OIDC service ids to support 2 domains (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1212429 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:56:46] (03Abandoned) 10Brouberol: testkitchen: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212429 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [10:58:57] (03CR) 10Brouberol: "`" [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1100) [11:00:09] (03CR) 10Brouberol: [C:03+2] test-kitchen-next: allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [11:02:04] (03CR) 10A smart kitten: "Resolved - No community objection to also preventing FlaggedRevs from checking whether any transcluded pages _have_ changes pending review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) (owner: 10A smart kitten) [11:02:31] (03CR) 10A smart kitten: "Hey! Given the... uhh, 'interesting', state of the FlaggedRevs codebase, I'd like to get a review/+1 from someone who knows more about the" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1201051 (https://phabricator.wikimedia.org/T408110) (owner: 10A smart kitten) [11:02:50] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie [11:03:03] !log btullis@cumin1003 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1010.eqiad.wmnet [11:09:26] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1010.eqiad.wmnet [11:25:53] (03CR) 10Muehlenhoff: "Looks good, two nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [11:27:22] (03PS14) 10Elukey: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 [11:28:50] !log elukey@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host ml-serve1013.eqiad.wmnet with OS trixie [11:29:04] !log restarting envoyproxy process on cephosd100[1-5] for T405808 [11:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:07] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [11:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:30:50] Lucas_WMDE: I'm patiently waiting for CI on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikimediaEvents/+/1213429 [11:32:11] (03CR) 10Matthias Mullie: [C:03+1] ReaderExperiments' StickyHeaders stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212134 (https://phabricator.wikimedia.org/T410533) (owner: 10Marco Fossati) [11:35:27] (03CR) 10Brouberol: [C:03+2] test-kitchen-next: set the OIDC callback URL domain to test-kitchen-next.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212420 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [11:36:05] !log elukey@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1013.eqiad.wmnet with OS trixie [11:39:19] (03PS1) 10Muehlenhoff: Extend Cumin alias for backups [puppet] - 10https://gerrit.wikimedia.org/r/1213432 [11:41:32] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:41:39] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11418917 (10MoritzMuehlenhoff) [11:47:41] !log elukey@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [11:48:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2157.codfw.wmnet with reason: Maintenance [11:49:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2157 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86236 and previous config saved to /var/cache/conftool/dbconfig/20251201-114902-marostegui.json [11:49:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:49:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:50:38] (03CR) 10Jcrespo: [C:03+1] "Indeed, I forgot to add that last role." [puppet] - 10https://gerrit.wikimedia.org/r/1213432 (owner: 10Muehlenhoff) [11:51:23] (03CR) 10Jcrespo: [C:03+1] "Let me know if you want me to deploy, not touching it unless you tell me to." [puppet] - 10https://gerrit.wikimedia.org/r/1213432 (owner: 10Muehlenhoff) [11:53:32] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1013.eqiad.wmnet with reason: host reimage [11:54:02] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [11:56:16] (03PS1) 10Majavah: P:wmcs::cloudgw: Implement new virt network config structure [puppet] - 10https://gerrit.wikimedia.org/r/1213436 (https://phabricator.wikimedia.org/T411081) [11:56:19] (03PS1) 10Bartosz Dziewoński: CentralAuthUser: Cache getLocalGroups() [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213437 (https://phabricator.wikimedia.org/T410878) [11:56:30] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213437 (https://phabricator.wikimedia.org/T410878) (owner: 10Bartosz Dziewoński) [12:06:15] (03CR) 10Muehlenhoff: "Sure, go ahead :-)" [puppet] - 10https://gerrit.wikimedia.org/r/1213432 (owner: 10Muehlenhoff) [12:06:42] (03CR) 10Jcrespo: [C:03+2] Extend Cumin alias for backups [puppet] - 10https://gerrit.wikimedia.org/r/1213432 (owner: 10Muehlenhoff) [12:14:32] (03PS1) 10Esanders: FlowMoveBoardsToSubpages: Add 'title' option for moving a specific board [extensions/Flow] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213442 (https://phabricator.wikimedia.org/T402552) [12:14:57] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [extensions/Flow] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213442 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [12:21:50] (03PS2) 10Majavah: P:wmcs::cloudgw: Implement new virt network config structure [puppet] - 10https://gerrit.wikimedia.org/r/1213436 (https://phabricator.wikimedia.org/T411081) [12:24:03] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7771/co" [puppet] - 10https://gerrit.wikimedia.org/r/1213436 (https://phabricator.wikimedia.org/T411081) (owner: 10Majavah) [12:36:37] (03PS1) 10Elukey: profile::amd_gpu: update firmware-amd-graphics target for trixie [puppet] - 10https://gerrit.wikimedia.org/r/1213450 [12:37:58] (03CR) 10Hnowlan: "Makes sense to me broadly. I suspect we can just remove this group though" [puppet] - 10https://gerrit.wikimedia.org/r/1212530 (owner: 10Muehlenhoff) [12:39:13] (03PS1) 10Ladsgroup: admin: Add my FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1213451 [12:39:24] (03CR) 10Elukey: [C:03+2] "Merging to unblock the ml-serve1013's reimage." [puppet] - 10https://gerrit.wikimedia.org/r/1213450 (owner: 10Elukey) [12:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:41:24] 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11419139 (10MoritzMuehlenhoff) Rancid is a bit of a maze of scripts calling each other, but I could eventually track it down to /usr/bin/control_rancid. In our case, the... [12:45:46] (03PS1) 10Slyngshede: Form labels: Fix for labels for Codex styled forms [software/bitu] - 10https://gerrit.wikimedia.org/r/1213452 (https://phabricator.wikimedia.org/T410492) [12:47:02] !log elukey@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1013.eqiad.wmnet with OS trixie [12:53:03] !log elukey@cumin2002 START - Cookbook sre.hosts.powercycle for host ml-serve1013 [12:55:28] PROBLEM - Host ml-serve1013 is DOWN: PING CRITICAL - Packet loss = 100% [12:56:32] ACKNOWLEDGEMENT - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2239) taken more than 3 days ago: Most recent backup 2025-11-27 04:56:09 Jcrespo rerun after package upgrade https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:56:32] ACKNOWLEDGEMENT - snapshot of s4 in codfw on backupmon1001 is CRITICAL: snapshot for s4 at codfw (db2239) taken more than 3 days ago: Most recent backup 2025-11-27 01:55:26 Jcrespo rerun after package upgrade https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [12:56:58] RECOVERY - Host ml-serve1013 is UP: PING OK - Packet loss = 0%, RTA = 0.39 ms [12:58:08] !log elukey@cumin2002 END (PASS) - Cookbook sre.hosts.powercycle (exit_code=0) for host ml-serve1013 [13:11:20] 06SRE, 10SRE-Access-Requests: Yubikey-SSH-FIDO for Hugh Nowlan (hnowlan) - https://phabricator.wikimedia.org/T411365 (10hnowlan) 03NEW [13:12:40] (03PS1) 10Hnowlan: admin: add yubikey for hnowlan [puppet] - 10https://gerrit.wikimedia.org/r/1213456 (https://phabricator.wikimedia.org/T411365) [13:13:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1213451 (owner: 10Ladsgroup) [13:16:41] !log imported rancid 3.13-2+wmf12u1 for bookworm-wikimedia and 3.14-1+wmf13u1 for trixie-wikimedia T410606 [13:16:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:44] T410606: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606 [13:16:51] (03PS2) 10Ladsgroup: admin: Add my FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1213451 [13:16:55] (03CR) 10Ladsgroup: [V:03+2 C:03+2] admin: Add my FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1213451 (owner: 10Ladsgroup) [13:17:14] 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11419237 (10MoritzMuehlenhoff) Also reported to Debian as https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1121730 [13:20:00] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:21:30] (03PS1) 10Majavah: Add dumps-rsync [dns] - 10https://gerrit.wikimedia.org/r/1213461 (https://phabricator.wikimedia.org/T306550) [13:23:29] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] "The group is only applied on the graphite* hosts, which will be taken down in half a year, we can clean out the group with all the other d" [puppet] - 10https://gerrit.wikimedia.org/r/1212530 (owner: 10Muehlenhoff) [13:25:31] 06SRE, 06Infrastructure-Foundations, 10netops: rancid: message has lines too long for transport - https://phabricator.wikimedia.org/T410606#11419246 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Updates have been rolled out and diffs are being sent again. [13:26:41] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1213456 (https://phabricator.wikimedia.org/T411365) (owner: 10Hnowlan) [13:28:54] (03CR) 10Lucas Werkmeister (WMDE): ReaderExperiments' StickyHeaders stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212134 (https://phabricator.wikimedia.org/T410533) (owner: 10Marco Fossati) [13:29:55] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Api: Initialise reference variable [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212611 (https://phabricator.wikimedia.org/T411075) (owner: 10Bartosz Dziewoński) [13:30:13] (03PS1) 10Majavah: P:dumps::distribution: Simplify how Rsync ACLs are generated [puppet] - 10https://gerrit.wikimedia.org/r/1213463 [13:32:02] (03CR) 10CI reject: [V:04-1] P:dumps::distribution: Simplify how Rsync ACLs are generated [puppet] - 10https://gerrit.wikimedia.org/r/1213463 (owner: 10Majavah) [13:32:18] (03CR) 10Kgraessle: Enable revertrisk filters in thwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [13:33:03] (03CR) 10Harroyo-wmf: [C:03+1] Set new $wgRateLimits config for edit attempt log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) (owner: 10Samuel (WMF)) [13:33:19] (03PS2) 10Majavah: P:dumps::distribution: Simplify how Rsync ACLs are generated [puppet] - 10https://gerrit.wikimedia.org/r/1213463 [13:34:42] (03CR) 10Lucas Werkmeister (WMDE): [C:04-1] Deploy 2025 Global Readers Survey (non-enwiki) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [13:35:18] (03PS3) 10Majavah: P:dumps::distribution: Simplify how Rsync ACLs are generated [puppet] - 10https://gerrit.wikimedia.org/r/1213463 [13:36:10] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/7774/co" [puppet] - 10https://gerrit.wikimedia.org/r/1213463 (owner: 10Majavah) [13:36:22] (03CR) 10Lucas Werkmeister (WMDE): "Looks safe to backport to me, given it’s a maintenance script (and one that’s not run on any kind of timer AFAICT), but it would be nice t" [extensions/Flow] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213442 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [13:42:16] !log upgrade Envoy on deployment servers T405808 [13:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:19] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [13:43:05] !log T408431: reindexing all wikis in codfw [13:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:07] T408431: Reindex all wikis - https://phabricator.wikimedia.org/T408431 [13:46:33] “Deploy various plugins to fix various things” so specific :D [13:46:56] 10SRE-tools, 06Data-Persistence, 06Infrastructure-Foundations, 06serviceops-radar: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152#11419328 (10ayounsi) 05Open→03Resolved a:03ayounsi I think this is all done, we have cookbooks in place. [13:55:40] (03CR) 10Muehlenhoff: [C:03+1] "Looks good, the srange() function of firewall::service resolves host names on the puppet server side and handles 4/6 correctly." [puppet] - 10https://gerrit.wikimedia.org/r/1213463 (owner: 10Majavah) [13:56:11] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution: Simplify how Rsync ACLs are generated [puppet] - 10https://gerrit.wikimedia.org/r/1213463 (owner: 10Majavah) [13:56:20] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11419367 (10JKelsoteel-WMF) Just an update - Noah (the requester) needs the email address by December 3. [13:56:27] (03PS1) 10Kosta Harlan: EventLogging: Register mediawiki.hcaptcha.edit stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213468 (https://phabricator.wikimedia.org/T406865) [13:59:07] (03CR) 10Majavah: [C:03+2] mediawiki: Add redirects for old Toki Pona aliases [puppet] - 10https://gerrit.wikimedia.org/r/1212578 (https://phabricator.wikimedia.org/T404507) (owner: 10Majavah) [14:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: May I have your attention please! UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1400) [14:00:05] mfossati, MatmaRex, danisztls, and edsanders: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:15] o/ [14:00:21] o/ [14:00:30] o/ [14:00:42] I could deploy :) [14:01:35] sounds good [14:01:36] let’s start with mfossati [14:01:48] do you want to self-service? [14:02:09] 10ops-magru, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: MAGRU power maint - CHG0262056 - October 29-30, 2025 - https://phabricator.wikimedia.org/T408589#11419402 (10ayounsi) 05Open→03Resolved a:03ayounsi I guess we're good here. [14:02:11] also fine, I'll go for it then [14:02:11] (had to look up if you have deployment access first ^^) [14:02:14] ok! [14:02:27] starting [14:02:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mfossati@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212134 (https://phabricator.wikimedia.org/T410533) (owner: 10Marco Fossati) [14:03:11] (03PS1) 10Ladsgroup: admin: Remove my non-FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1213472 [14:03:14] MatmaRex: should your backports be deployed together or separately? [14:03:56] (03PS2) 10Kosta Harlan: EventLogging: Register mediawiki.hcaptcha.edit stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213468 (https://phabricator.wikimedia.org/T406865) [14:04:33] (03CR) 10Ladsgroup: [C:03+2] admin: Remove my non-FIDO key [puppet] - 10https://gerrit.wikimedia.org/r/1213472 (owner: 10Ladsgroup) [14:04:55] hi [14:05:00] Lucas_WMDE: either is fine [14:05:22] ok [14:05:32] (03Merged) 10jenkins-bot: ReaderExperiments' StickyHeaders stream configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212134 (https://phabricator.wikimedia.org/T410533) (owner: 10Marco Fossati) [14:05:38] the CentralAuthUser change looks a bit scary to me so I think I’d go for separately [14:05:44] and hope we have enough time [14:05:55] !log mfossati@deploy2002 Started scap sync-world: Backport for [[gerrit:1212134|ReaderExperiments' StickyHeaders stream configuration (T410533)]] [14:05:58] T410533: Extend SessionLengthInstrumentMixin to support xLab Experiments - https://phabricator.wikimedia.org/T410533 [14:06:08] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212611 (https://phabricator.wikimedia.org/T411075) (owner: 10Bartosz Dziewoński) [14:09:02] !log mfossati@deploy2002 mfossati: Backport for [[gerrit:1212134|ReaderExperiments' StickyHeaders stream configuration (T410533)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:09:27] checking [14:11:25] !log mfossati@deploy2002 mfossati: Continuing with sync [14:13:36] (03CR) 10DDesouza: [C:04-1] Deploy 2025 Global Readers Survey (non-enwiki) (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [14:17:46] !log mfossati@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212134|ReaderExperiments' StickyHeaders stream configuration (T410533)]] (duration: 11m 51s) [14:17:49] T410533: Extend SessionLengthInstrumentMixin to support xLab Experiments - https://phabricator.wikimedia.org/T410533 [14:18:09] Lucas_WMDE: done here! [14:18:15] ok, thanks! [14:18:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212611 (https://phabricator.wikimedia.org/T411075) (owner: 10Bartosz Dziewoński) [14:18:43] let’s fix the top entry of logspam-watch at the moment ^^ [14:19:55] (03CR) 10Ssingh: [C:03+1] "Let's try it." [cookbooks] - 10https://gerrit.wikimedia.org/r/1212596 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [14:22:25] Lucas_WMDE: the fix for T411075 is tricky to verify without cluttering production wikis, so i suggest we just deploy it and watch the logs afterwards. i don't have anything to test on mwdebug [14:22:27] T411075: TypeError: Unsupported operand types: array + null - https://phabricator.wikimedia.org/T411075 [14:22:57] (03Merged) 10jenkins-bot: Api: Initialise reference variable [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212611 (https://phabricator.wikimedia.org/T411075) (owner: 10Bartosz Dziewoński) [14:23:16] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1212611|Api: Initialise reference variable (T411075)]] [14:23:39] i can probably test the CentralAuth fix by logging in twice and seeing if the second time is faster (with profiling) [14:25:06] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1212611|Api: Initialise reference variable (T411075)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:26:18] sorry, got sidetracked by T392023 for a sec [14:26:19] T392023: RuntimeException: At least one of user ID, actor ID or user name must be given - https://phabricator.wikimedia.org/T392023 [14:26:24] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Continuing with sync [14:26:29] going ahead with the first fix, thanks [14:26:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213437 (https://phabricator.wikimedia.org/T410878) (owner: 10Bartosz Dziewoński) [14:26:57] (03CR) 10Marco Fossati: ReaderExperiments' StickyHeaders stream configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212134 (https://phabricator.wikimedia.org/T410533) (owner: 10Marco Fossati) [14:28:04] !log sukhe@cumin1003 START - Cookbook sre.hosts.reimage for host cp2043.codfw.wmnet with OS trixie [14:28:20] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11419516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin1003 for host cp2043.codfw.wmnet with OS trixie [14:29:22] so far logspam-watch shows if anything a slight upwards trend in the array + null warning lol [14:29:26] * Lucas_WMDE waits patiently [14:30:19] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212611|Api: Initialise reference variable (T411075)]] (duration: 07m 04s) [14:30:22] T411075: TypeError: Unsupported operand types: array + null - https://phabricator.wikimedia.org/T411075 [14:30:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213437 (https://phabricator.wikimedia.org/T410878) (owner: 10Bartosz Dziewoński) [14:31:05] (03CR) 10Gkyziridis: [C:03+1] "LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207932 (https://phabricator.wikimedia.org/T409438) (owner: 10Kgraessle) [14:31:12] ok, mediawiki-errors in logstash is looking okay so far [14:31:21] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213468 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [14:31:22] (03Merged) 10jenkins-bot: CentralAuthUser: Cache getLocalGroups() [extensions/CentralAuth] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213437 (https://phabricator.wikimedia.org/T410878) (owner: 10Bartosz Dziewoński) [14:31:40] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1213437|CentralAuthUser: Cache getLocalGroups() (T410878)]] [14:31:43] T410878: wmfGetPrivilegedGroups is slow - https://phabricator.wikimedia.org/T410878 [14:31:53] the array + null error wasn't that common, about one per minute [14:32:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) (owner: 10Samuel (WMF)) [14:32:27] yeah [14:32:43] the two patches I added to the window are both no-ops [14:33:03] 06SRE, 10DNS, 06serviceops, 06Traffic, 07Language codes: Redirect legacy language codes for Toki Pona to tok.wikipedia.org - https://phabricator.wikimedia.org/T404507#11419541 (10taavi) 05Open→03Resolved a:03taavi [14:33:32] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Backport for [[gerrit:1213437|CentralAuthUser: Cache getLocalGroups() (T410878)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:34:06] MatmaRex: please test :) [14:34:12] * Lucas_WMDE looks at kostajh’s patches [14:34:22] oh right, edsanders would be up first though [14:34:55] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "master change was merged, so let’s start gate-and-submit before deployment here :)" [extensions/Flow] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213442 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [14:35:29] (and skipping Daniel’s change per the latest comments on it, just for the record) [14:35:31] I can deploy [14:35:50] sounds good to me (once the current one is done) [14:36:06] (03CR) 10Slyngshede: [C:03+2] Update Meta geo-mapping [dns] - 10https://gerrit.wikimedia.org/r/1206185 (https://phabricator.wikimedia.org/T409735) (owner: 10Slyngshede) [14:36:19] !log slyngshede@dns1004 START - running authdns-update [14:36:48] ugh, logging in makes so many HTTP requests that the profiles for the first ones (the important ones) are no longer available in the WikimediaDebug extension's popup window [14:36:59] it apparently has a limit of 100 [14:37:30] !log slyngshede@dns1004 END - running authdns-update [14:37:40] :/ [14:38:07] 06SRE, 06Traffic, 13Patch-For-Review: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11419563 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF [14:38:11] can you find them in xhgui? e.g. https://performance.wikimedia.org/xhgui/?url=index.php [14:40:57] actually, nevermind, if the latest result of that search is 26 Nov then it’s definitely not you [14:41:11] i think i got what i needed now [14:41:22] the second login attempt does a bit less extra stuff [14:41:52] so i have this profile: https://performance.wikimedia.org/excimer/profile/298342f6fd3fd712 which doesn't have the cached function in it [14:41:56] i think things look good [14:42:18] * Lucas_WMDE should learn about excimer at some point [14:42:19] ok to deploy? [14:42:23] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374 (10Jclark-ctr) 03NEW [14:42:29] yep [14:42:32] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, matmarex: Continuing with sync [14:42:33] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411375 (10Jclark-ctr) 03NEW [14:42:35] ok, thanks! [14:43:39] (03PS1) 10Herron: Revert "thanos-store: set cutoff days to 1" [puppet] - 10https://gerrit.wikimedia.org/r/1213487 [14:43:48] (03Merged) 10jenkins-bot: FlowMoveBoardsToSubpages: Add 'title' option for moving a specific board [extensions/Flow] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213442 (https://phabricator.wikimedia.org/T402552) (owner: 10Esanders) [14:46:32] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213437|CentralAuthUser: Cache getLocalGroups() (T410878)]] (duration: 14m 51s) [14:46:36] T410878: wmfGetPrivilegedGroups is slow - https://phabricator.wikimedia.org/T410878 [14:46:41] 06SRE, 10Cassandra, 06Data-Persistence: Discovery of Cassandra cluster nodes - https://phabricator.wikimedia.org/T410075#11419626 (10Eevans) >>! In T410075#11412416, @elukey wrote: >>>! In T410075#11409819, @Eevans wrote: >> > [ ... ] > The only thing that we can explore at this point is a custom external s... [14:46:46] edsanders: over to you [14:46:49] ok [14:47:00] (03PS5) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [14:47:00] (03PS6) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [14:47:00] (03PS6) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [14:47:01] (03PS6) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [14:47:02] (03PS6) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [14:47:03] (03PS6) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [14:47:07] (03PS1) 10Brouberol: Redirect mpic-next.w.o to test-kitchen-next.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1213489 (https://phabricator.wikimedia.org/T407805) [14:47:13] MatmaRex: the array + null errors seem to have stopped completely \o/ [14:47:39] !log esanders@deploy2002 Started scap sync-world: Backport for [[gerrit:1213442|FlowMoveBoardsToSubpages: Add 'title' option for moving a specific board (T402552)]] [14:47:41] T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552 [14:47:48] thanks for deploying [14:48:23] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11419635 (10Jclark-ctr) @MoritzMuehlenhoff not working for me ` jclark@ganeti1039:~$ sudo smartctl -H /dev/sda [sudo] password for jclark: ` [14:48:49] (03CR) 10Elukey: [C:03+1] Revert "thanos-store: set cutoff days to 1" [puppet] - 10https://gerrit.wikimedia.org/r/1213487 (owner: 10Herron) [14:49:39] !log esanders@deploy2002 esanders: Backport for [[gerrit:1213442|FlowMoveBoardsToSubpages: Add 'title' option for moving a specific board (T402552)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [14:50:03] !log esanders@deploy2002 esanders: Continuing with sync [14:50:16] (03CR) 10Eevans: [C:03+1] Remove unused cassandra-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1212140 (owner: 10Muehlenhoff) [14:51:29] (03PS2) 10Brouberol: Redirect mpic-next.w.o to test-kitchen-next.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1213489 (https://phabricator.wikimedia.org/T407805) [14:51:29] (03PS3) 10Brouberol: test-kitchen: allow both mpic/test-kitchen domains in the OIDC config [puppet] - 10https://gerrit.wikimedia.org/r/1213428 [14:51:29] (03PS3) 10Brouberol: test-kitchen: Allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1213427 [14:51:30] (03PS6) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [14:51:31] (03PS7) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [14:51:32] (03PS7) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [14:51:36] (03PS7) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [14:51:40] (03PS7) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [14:51:44] (03PS7) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [14:52:03] kostajh: want to self-service your changes afterwards? [14:53:09] (03PS1) 10Andrew Bogott: cloudweb100[34]: prepare for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1213491 (https://phabricator.wikimedia.org/T409579) [14:53:52] 06SRE, 06Infrastructure-Foundations: Request additional access for Dcops group - https://phabricator.wikimedia.org/T395939#11419648 (10elukey) @MoritzMuehlenhoff what we may need to do is to move all disk/partition/raid/etc.. commands from `datacenter-ops` to `ops-limited`, what do you think? [14:54:10] !log esanders@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213442|FlowMoveBoardsToSubpages: Add 'title' option for moving a specific board (T402552)]] (duration: 06m 31s) [14:54:13] T402552: ptwikibooks: Migrate Flow boards to archival subpages - https://phabricator.wikimedia.org/T402552 [14:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:26] !log andrew@cumin2002 START - Cookbook sre.hosts.provision for host cloudweb1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [14:56:31] Lucas_WMDE: yes I could run my changes after [14:56:34] (03CR) 10Alexandros Kosiaris: [C:03+1] Redirect mpic-next.w.o to test-kitchen-next.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1213489 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:56:40] it’s free now, go ahead [14:56:42] jouncebot: next [14:56:42] In 0 hour(s) and 33 minute(s): xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1530) [14:56:53] (03CR) 10Brouberol: [C:03+2] Redirect mpic-next.w.o to test-kitchen-next.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1213489 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [14:56:54] Lucas_WMDE: thanks [14:58:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213468 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [14:58:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) (owner: 10Samuel (WMF)) [14:59:39] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS bookworm [14:59:46] PROBLEM - Host cloudweb1004 is DOWN: PING CRITICAL - Packet loss = 100% [14:59:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11419668 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS bookworm [15:00:00] (03Merged) 10jenkins-bot: EventLogging: Register mediawiki.hcaptcha.edit stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213468 (https://phabricator.wikimedia.org/T406865) (owner: 10Kosta Harlan) [15:00:38] (03PS11) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [15:00:57] (03Merged) 10jenkins-bot: Set new $wgRateLimits config for edit attempt log [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211295 (https://phabricator.wikimedia.org/T406865) (owner: 10Samuel (WMF)) [15:01:14] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411375#11419684 (10SLyngshede-WMF) →14Duplicate dup:03T411374 [15:01:15] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11419686 (10SLyngshede-WMF) [15:01:17] !log kharlan@deploy2002 Started scap sync-world: Backport for [[gerrit:1213468|EventLogging: Register mediawiki.hcaptcha.edit stream (T406865)]], [[gerrit:1211295|Set new $wgRateLimits config for edit attempt log (T406865)]] [15:01:19] T406865: hCaptcha: Implement mechanism to log about-to-be-published content when challenge is presented - https://phabricator.wikimedia.org/T406865 [15:03:00] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudweb1004.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [15:03:18] !log kharlan@deploy2002 kharlan, sguebo: Backport for [[gerrit:1213468|EventLogging: Register mediawiki.hcaptcha.edit stream (T406865)]], [[gerrit:1211295|Set new $wgRateLimits config for edit attempt log (T406865)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [15:06:58] (03PS1) 10Gergő Tisza: Convert README to Markdown [puppet] - 10https://gerrit.wikimedia.org/r/1213496 [15:07:10] !log kharlan@deploy2002 kharlan, sguebo: Continuing with sync [15:07:27] (03PS1) 10Gergő Tisza: Change the README to Markdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 [15:07:27] (03PS1) 10Gergő Tisza: noc: Point links in /conf to Gitiles rather than Differential [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213498 [15:10:00] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:10:12] (03CR) 10Jforrester: [C:03+1] Change the README to Markdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 (owner: 10Gergő Tisza) [15:10:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86237 and previous config saved to /var/cache/conftool/dbconfig/20251201-151019-marostegui.json [15:10:20] (03CR) 10Jforrester: [C:03+1] noc: Point links in /conf to Gitiles rather than Differential [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213498 (owner: 10Gergő Tisza) [15:10:25] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:10:25] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:12:20] !log kharlan@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213468|EventLogging: Register mediawiki.hcaptcha.edit stream (T406865)]], [[gerrit:1211295|Set new $wgRateLimits config for edit attempt log (T406865)]] (duration: 11m 03s) [15:12:23] T406865: hCaptcha: Implement mechanism to log about-to-be-published content when challenge is presented - https://phabricator.wikimedia.org/T406865 [15:13:28] (03CR) 10Tiziano Fogli: [C:03+2] Revert "thanos-store: set cutoff days to 1" [puppet] - 10https://gerrit.wikimedia.org/r/1213487 (owner: 10Herron) [15:15:21] !log UTC afternoon backport+config window done [15:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:15] (03PS1) 10Esanders: Set Flow to read-only on ptwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213501 (https://phabricator.wikimedia.org/T402552) [15:16:42] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1213502 [15:18:56] sukhe@cumin1003 reimage (PID 4050898) is awaiting input [15:19:20] (03CR) 10Hnowlan: [C:03+2] admin: add yubikey for hnowlan [puppet] - 10https://gerrit.wikimedia.org/r/1213456 (https://phabricator.wikimedia.org/T411365) (owner: 10Hnowlan) [15:19:21] !log sukhe@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp2043.codfw.wmnet with OS trixie [15:19:30] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [15:19:34] 10ops-codfw, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#11419755 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin1003 for host cp2043.codfw.wmnet with OS trixie executed with err... [15:20:00] FIRING: [3x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [15:23:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [15:24:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1032.eqiad.wmnet with reason: host reimage [15:25:00] FIRING: [4x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:25:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P86238 and previous config saved to /var/cache/conftool/dbconfig/20251201-152527-marostegui.json [15:26:36] (03PS1) 10Daniel Kinzler: rest gateway: do not rate limit internal traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213504 (https://phabricator.wikimedia.org/T410143) [15:28:23] (03CR) 10Tjones: "Looks good! I'm looking forward to it being out there!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212150 (https://phabricator.wikimedia.org/T408737) (owner: 10DCausse) [15:29:16] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11419795 (10jhathaway) p:05Triage→03Medium [15:30:00] (03PS7) 10Brouberol: test-kitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [15:30:00] (03PS8) 10Brouberol: test-kitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [15:30:02] (03PS8) 10Brouberol: test-kitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [15:30:04] Deploy window xLab Experiment Deployment Window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1530) [15:30:06] (03PS8) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [15:30:10] (03PS8) 10Brouberol: Move mpic service mesh entry to test-kitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [15:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:30:14] (03PS8) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [15:30:18] (03PS1) 10Brouberol: Redirect mpic.w.o to test-kitchen.w.o [puppet] - 10https://gerrit.wikimedia.org/r/1213505 (https://phabricator.wikimedia.org/T407805) [15:30:23] 06SRE, 06Infrastructure-Foundations: Improve "reuse" feature for standard partman recipes - https://phabricator.wikimedia.org/T410601#11419804 (10MoritzMuehlenhoff) p:05Triage→03Low [15:31:19] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 06Traffic: Reboot cookbook workflow leaves Puppet disabled - https://phabricator.wikimedia.org/T410944#11419817 (10elukey) p:05Triage→03Medium [15:32:57] 06SRE, 06Infrastructure-Foundations: wmf-auto-restart: Add a filter list - https://phabricator.wikimedia.org/T411032#11419832 (10MoritzMuehlenhoff) p:05Triage→03Low [15:34:41] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1187 gradually with 4 steps - After schema change [15:35:00] RESOLVED: [4x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:35:07] 06SRE, 06Infrastructure-Foundations: Broadcom Nic not supporting uefi with older firmware - https://phabricator.wikimedia.org/T411374#11419840 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:35:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [15:37:33] (03CR) 10Krinkle: [C:03+1] Convert README to Markdown [puppet] - 10https://gerrit.wikimedia.org/r/1213496 (owner: 10Gergő Tisza) [15:37:48] (03CR) 10Krinkle: [C:03+1] Change the README to Markdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 (owner: 10Gergő Tisza) [15:38:27] 06SRE, 06Traffic: Meta query about why we map 31.13.103.0/24 to US - https://phabricator.wikimedia.org/T409735#11419845 (10SLyngshede-WMF) @cmooney can you let the people from Meta know that this should be fixed now? [15:39:03] (03PS1) 10Majavah: P:grafana: Default to UTC timezone [puppet] - 10https://gerrit.wikimedia.org/r/1213506 (https://phabricator.wikimedia.org/T411274) [15:40:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:40:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157', diff saved to https://phabricator.wikimedia.org/P86240 and previous config saved to /var/cache/conftool/dbconfig/20251201-154035-marostegui.json [15:41:32] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:43:38] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T410589)', diff saved to https://phabricator.wikimedia.org/P86241 and previous config saved to /var/cache/conftool/dbconfig/20251201-154337-ladsgroup.json [15:43:40] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [15:43:55] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1032.eqiad.wmnet with OS bookworm [15:44:07] (03CR) 10Krinkle: [C:03+1] P:grafana: Default to UTC timezone [puppet] - 10https://gerrit.wikimedia.org/r/1213506 (https://phabricator.wikimedia.org/T411274) (owner: 10Majavah) [15:44:09] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11419870 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host wdqs1032.eqiad.wmnet with OS bookworm completed: - wdqs1... [15:45:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:47:20] (03PS1) 10BPirkle: REST: enable the site.v1 module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213515 (https://phabricator.wikimedia.org/T409516) [15:50:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:50:26] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS trixie [15:50:55] !log bking@wmf3062 restart wdqs codfw for high lag https://docs.google.com/spreadsheets/d/1UaabYlqj37EEaLAkrRArn4yNuNviGObgsGTfquIIHAQ/edit?gid=0#gid=0 [15:50:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:52:04] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:54:01] (03CR) 10Krinkle: [C:03+1] noc: Point links in /conf to Gitiles rather than Differential [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213498 (owner: 10Gergő Tisza) [15:54:48] (03CR) 10Filippo Giunchedi: "LGTM, adding o11y folks JFYI" [puppet] - 10https://gerrit.wikimedia.org/r/1213506 (https://phabricator.wikimedia.org/T411274) (owner: 10Majavah) [15:55:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [15:55:09] (03PS2) 10Muehlenhoff: Remove unused cassandra-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1212140 [15:55:31] (03CR) 10Andrew Bogott: [C:03+2] cloudweb100[34]: prepare for UEFI [puppet] - 10https://gerrit.wikimedia.org/r/1213491 (https://phabricator.wikimedia.org/T409579) (owner: 10Andrew Bogott) [15:55:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2157 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86243 and previous config saved to /var/cache/conftool/dbconfig/20251201-155542-marostegui.json [15:55:47] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:55:48] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:55:54] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudweb1004.wikimedia.org with OS trixie [15:55:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2171.codfw.wmnet with reason: Maintenance [15:56:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2171 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86244 and previous config saved to /var/cache/conftool/dbconfig/20251201-155606-marostegui.json [15:56:07] !log "thanos-store: set cutoff days to 1" reverted on titan1001 (1/4) [15:56:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:17] RESOLVED: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:56:39] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS trixie [15:56:40] !log "thanos-store: set cutoff days to 1" reverted on titan1001 (1/4) T410152 [15:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:43] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [15:56:53] (03CR) 10Filippo Giunchedi: [C:03+1] Add dumps-rsync [dns] - 10https://gerrit.wikimedia.org/r/1213461 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [15:58:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86245 and previous config saved to /var/cache/conftool/dbconfig/20251201-155844-ladsgroup.json [15:59:05] (03CR) 10CDanis: [C:03+2] stat hosts: zram: use up to 50% of RAM [puppet] - 10https://gerrit.wikimedia.org/r/1211744 (https://phabricator.wikimedia.org/T376813) (owner: 10CDanis) [15:59:05] (03CR) 10Majavah: [C:03+2] Add dumps-rsync [dns] - 10https://gerrit.wikimedia.org/r/1213461 (https://phabricator.wikimedia.org/T306550) (owner: 10Majavah) [15:59:17] !log taavi@dns1004 START - running authdns-update [16:00:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:00:10] !log taavi@dns1004 END - running authdns-update [16:01:17] FIRING: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:01:31] (03CR) 10Muehlenhoff: [C:03+2] Remove unused cassandra-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1212140 (owner: 10Muehlenhoff) [16:02:06] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:02:06] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:05:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:05:20] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1213452 (https://phabricator.wikimedia.org/T410492) (owner: 10Slyngshede) [16:06:17] FIRING: [3x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:03] (03PS1) 10CDanis: zramswap: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/1213521 (https://phabricator.wikimedia.org/T376813) [16:08:44] 06SRE, 10conftool, 06Data-Persistence, 06Infrastructure-Foundations: Integrate dbctl IP changes as part of VLAN changes. - https://phabricator.wikimedia.org/T360029#11419969 (10Marostegui) I've been talking to @ayounsi about this ticket, who could help us doing the automation/integration of dbctl and this... [16:09:56] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:10:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:11:17] FIRING: [4x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:56] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.079 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:11:56] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9309 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:12:03] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11419980 (10herron) [16:12:38] 10SRE-SLO: Sloth: adapt default month view to quarter view (pilot) - https://phabricator.wikimedia.org/T409312#11419997 (10herron) [16:13:08] 10SRE-SLO: Sloth: adapt default month view to quarter view (pilot) - https://phabricator.wikimedia.org/T409312#11419998 (10herron) 05Open→03Resolved a:03herron Agreed, looks good! [16:13:53] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86247 and previous config saved to /var/cache/conftool/dbconfig/20251201-161352-ladsgroup.json [16:15:02] RESOLVED: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:15:32] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11420003 (10herron) [16:16:17] FIRING: [5x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:16:20] (03CR) 10Tjones: [C:03+1] cirrus: enable georgian transliteration second try profile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212150 (https://phabricator.wikimedia.org/T408737) (owner: 10DCausse) [16:16:34] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959#11420006 (10Jhancock.wm) @MatthewVernon drive has arrived. please let me know if it's okay to replace the drive at this time. [16:19:19] (03CR) 10CDobbins: [C:03+2] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1212596 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [16:20:00] (03CR) 10Btullis: [C:03+1] zramswap: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/1213521 (https://phabricator.wikimedia.org/T376813) (owner: 10CDanis) [16:20:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:20:07] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1187 gradually with 4 steps - After schema change [16:21:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:22:22] (03CR) 10Effie Mouzeli: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [16:24:50] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11420022 (10jhathaway) @JKelsoteel-WMF the addresses `no-reply` or `noreply` are used to indicate that the sender does not expect... [16:28:35] !log "thanos-store: set cutoff days to 1" reverted on titan1002 (2/4) T410152 [16:28:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:38] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [16:29:00] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T410589)', diff saved to https://phabricator.wikimedia.org/P86249 and previous config saved to /var/cache/conftool/dbconfig/20251201-162900-ladsgroup.json [16:29:03] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [16:29:16] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance [16:29:24] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T410589)', diff saved to https://phabricator.wikimedia.org/P86250 and previous config saved to /var/cache/conftool/dbconfig/20251201-162923-ladsgroup.json [16:30:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:30:05] jan_drewniak: Time to snap out of that daydream and deploy Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1630). [16:30:10] (03CR) 10Andrew Bogott: [C:03+2] puppetserver: Generalize git-rebase fix to work for labs/private [puppet] - 10https://gerrit.wikimedia.org/r/1198373 (https://phabricator.wikimedia.org/T397877) (owner: 10Krinkle) [16:30:49] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudweb1004.wikimedia.org with OS trixie [16:31:06] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS trixie [16:31:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:31:32] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:31:49] !log depool ms-fe2014 for disk swap T410959 [16:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:52] T410959: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959 [16:32:37] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959#11420087 (10MatthewVernon) @Jhancock.wm please go ahead - server is depooled. [16:32:51] (03Merged) 10jenkins-bot: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1212596 (https://phabricator.wikimedia.org/T395240) (owner: 10CDobbins) [16:33:12] (03PS1) 10Muehlenhoff: Deprecate restbase-roots/restbase-admins [puppet] - 10https://gerrit.wikimedia.org/r/1213528 (https://phabricator.wikimedia.org/T276465) [16:35:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:35:31] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11420100 (10MoritzMuehlenhoff) [16:36:17] RESOLVED: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:37:23] 06SRE, 10Infrastructure Security, 06Infrastructure-Foundations, 13Patch-For-Review: puppet admin module: Assign approvers to unix groups - https://phabricator.wikimedia.org/T276465#11420101 (10MoritzMuehlenhoff) [16:38:04] (03PS2) 10Andrew Bogott: openstack: Remove OATHAuth 2FA (wmtotp) support [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) (owner: 10Majavah) [16:38:44] (03PS2) 10DDesouza: Deploy 2025 Global Readers Survey (non-enwiki) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) [16:39:30] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on ms-fe2014 - https://phabricator.wikimedia.org/T410959#11420132 (10Jhancock.wm) @MatthewVernon drive has been replaced. [16:39:57] (03PS3) 10Andrew Bogott: openstack: Remove OATHAuth 2FA (wmtotp) support [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) (owner: 10Majavah) [16:40:02] RESOLVED: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:40:59] (03CR) 10DDesouza: Deploy 2025 Global Readers Survey (non-enwiki) (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213123 (https://phabricator.wikimedia.org/T410918) (owner: 10DDesouza) [16:41:47] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) (owner: 10Majavah) [16:43:50] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [16:47:13] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [16:48:14] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [16:48:41] RECOVERY - snapshot of s3 in codfw on backupmon1001 is OK: Last snapshot for s3 at codfw (localhost) taken on 2025-12-01 12:52:00 (1159 GiB, -0.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [16:49:11] (03PS1) 10Aqu: Allow access to urldownloader to airflow-main/workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213529 (https://phabricator.wikimedia.org/T410285) [16:49:35] (03CR) 10Michael Große: [C:03+1] "From a purely config point-of-view, this is fine. Does it still need a product/design go-ahead?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) (owner: 10Urbanecm) [16:49:38] (03CR) 10Andrew Bogott: [C:03+2] openstack: Remove OATHAuth 2FA (wmtotp) support [puppet] - 10https://gerrit.wikimedia.org/r/1082239 (https://phabricator.wikimedia.org/T359590) (owner: 10Majavah) [16:50:28] jouncebot: nowandnext [16:50:28] For the next 0 hour(s) and 9 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1630) [16:50:28] In 1 hour(s) and 9 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1800) [16:50:28] In 1 hour(s) and 9 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1800) [16:52:09] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:52:27] !log hnowlan@deploy2002 Started deploy [restbase/deploy@19cb647]: Add new wikis to restbase T408352 T408344 [16:52:32] T408352: Add pcmwikiquote to RESTBase - https://phabricator.wikimedia.org/T408352 [16:52:32] T408344: Add minwikisource to RESTBase - https://phabricator.wikimedia.org/T408344 [16:55:02] (03PS2) 10Aqu: Allow access to urldownloader to airflow-main/workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213529 (https://phabricator.wikimedia.org/T410285) [16:55:27] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003*} and A:liberica [16:57:08] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11420216 (10Jhancock.wm) @MoritzMuehlenhoff the replacement has arrived. can you confirm that its safe to replace the drive at this time. Also can you help confirm that it's second drive that needs to be repl... [16:57:46] (03CR) 10Eevans: [C:03+1] Deprecate restbase-roots/restbase-admins [puppet] - 10https://gerrit.wikimedia.org/r/1213528 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [16:58:19] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) rebooting P{lvs6003*} and A:liberica [16:58:55] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [16:59:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1157 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86251 and previous config saved to /var/cache/conftool/dbconfig/20251201-165902-marostegui.json [16:59:07] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:59:07] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:00:11] (03CR) 10Btullis: [C:03+1] Allow access to urldownloader to airflow-main/workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213529 (https://phabricator.wikimedia.org/T410285) (owner: 10Aqu) [17:02:05] PROBLEM - mailman archives on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:02:05] PROBLEM - mailman list info on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:02:45] RECOVERY - snapshot of s4 in codfw on backupmon1001 is OK: Last snapshot for s4 at codfw (db2239) taken on 2025-12-01 14:37:02 (1966 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [17:05:03] (03PS1) 10Andrew Bogott: keystone.conf: remove one last ref to wmtotp [puppet] - 10https://gerrit.wikimedia.org/r/1213534 (https://phabricator.wikimedia.org/T359590) [17:06:07] (03CR) 10Andrew Bogott: [C:03+2] keystone.conf: remove one last ref to wmtotp [puppet] - 10https://gerrit.wikimedia.org/r/1213534 (https://phabricator.wikimedia.org/T359590) (owner: 10Andrew Bogott) [17:06:14] 06SRE, 10SRE-tools, 06Infrastructure-Foundations, 07IPv6: Some Foundation clusters do not appear to support IPv6 - https://phabricator.wikimedia.org/T271136#11420281 (10Dzahn) epic !:) [17:07:05] (03PS2) 10JHathaway: iPXE MBR support [software/spicerack] - 10https://gerrit.wikimedia.org/r/1211268 (https://phabricator.wikimedia.org/T409286) [17:08:08] (03CR) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [17:08:44] !log hnowlan@deploy2002 Finished deploy [restbase/deploy@19cb647]: Add new wikis to restbase T408352 T408344 (duration: 16m 16s) [17:08:48] T408352: Add pcmwikiquote to RESTBase - https://phabricator.wikimedia.org/T408352 [17:08:48] T408344: Add minwikisource to RESTBase - https://phabricator.wikimedia.org/T408344 [17:09:03] RECOVERY - mailman list info ssl expiry on lists1004 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 03 Feb 2026 07:30:03 PM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:09:35] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11420300 (10JKelsoteel-WMF) Hi @jhathaway, Noah's team is sending out a high-visibility/high-priority email to donors, and for th... [17:11:55] RECOVERY - mailman archives on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 55267 bytes in 0.074 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:11:55] RECOVERY - mailman list info on lists1004 is OK: HTTP OK: HTTP/1.1 200 OK - 9309 bytes in 0.183 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:12:57] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11420313 (10MoritzMuehlenhoff) @Jhancock.wm The broken disk is /dev/sda which per lshw has the serial 22353BB15C0C, does that help? I suppise these disks are hot-swappable? Then you can replace it anytime, I'... [17:17:16] !log "thanos-store: set cutoff days to 1" reverted on titan2002 (3/4) T410152 [17:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:19] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [17:19:45] (03CR) 10Brouberol: [C:03+1] Allow access to urldownloader to airflow-main/workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213529 (https://phabricator.wikimedia.org/T410285) (owner: 10Aqu) [17:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:21:51] (03CR) 10CDanis: [C:03+2] zramswap: notify service on config change [puppet] - 10https://gerrit.wikimedia.org/r/1213521 (https://phabricator.wikimedia.org/T376813) (owner: 10CDanis) [17:24:03] (03PS1) 10CDanis: Revert "zramswap: notify service on config change" [puppet] - 10https://gerrit.wikimedia.org/r/1213535 [17:24:13] (03CR) 10CDanis: [V:03+2 C:03+2] Revert "zramswap: notify service on config change" [puppet] - 10https://gerrit.wikimedia.org/r/1213535 (owner: 10CDanis) [17:24:29] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on bast2003 - https://phabricator.wikimedia.org/T410195#11420424 (10Jhancock.wm) @MoritzMuehlenhoff thanks for the help and correction. 22353BB15C0C has been replaced. They are hot-swappable. Its a personal preference that I check so I don't inadvertently muck so... [17:29:46] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11420456 (10JKelsoteel-WMF) @jhathaway I did create a group called "noreply@wikimedia.org" to see if messages to that address wer... [17:30:45] jouncebot: nowandnext [17:30:45] No deployments scheduled for the next 0 hour(s) and 29 minute(s) [17:30:46] In 0 hour(s) and 29 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1800) [17:30:46] In 0 hour(s) and 29 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1800) [17:31:32] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1004.wikimedia.org with OS trixie [17:31:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208478 (https://phabricator.wikimedia.org/T410702) (owner: 10BryanDavis) [17:32:26] (03Merged) 10jenkins-bot: labswiki: Enable sitenotice on mobile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208478 (https://phabricator.wikimedia.org/T410702) (owner: 10BryanDavis) [17:32:37] !log andrew@cumin2002 START - Cookbook sre.hosts.provision for host cloudweb1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:32:47] !log bd808@deploy2002 Started scap sync-world: Backport for [[gerrit:1208478|labswiki: Enable sitenotice on mobile (T410702)]] [17:32:49] T410702: Enable sitenotice on mobile for Wikitech - https://phabricator.wikimedia.org/T410702 [17:34:34] (03PS1) 10Dduvall: buildkitd: Bump buildkit image to wmf-v0.26.2 [puppet] - 10https://gerrit.wikimedia.org/r/1213538 (https://phabricator.wikimedia.org/T410049) [17:34:37] (03CR) 10Urbanecm: "Plus QA, yes. I'm waiting on that." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211810 (https://phabricator.wikimedia.org/T410970) (owner: 10Urbanecm) [17:34:41] !log bd808@deploy2002 bd808: Backport for [[gerrit:1208478|labswiki: Enable sitenotice on mobile (T410702)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [17:35:35] !log bd808@deploy2002 bd808: Continuing with sync [17:35:55] (03CR) 10Dduvall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1213538 (https://phabricator.wikimedia.org/T410049) (owner: 10Dduvall) [17:35:59] PROBLEM - Host cloudweb1003 is DOWN: PING CRITICAL - Packet loss = 100% [17:36:13] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:36:15] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:39:16] !log "thanos-store: set cutoff days to 1" reverted on titan2001 (4/4) T410152 [17:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:19] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [17:39:36] !log bd808@deploy2002 Finished scap sync-world: Backport for [[gerrit:1208478|labswiki: Enable sitenotice on mobile (T410702)]] (duration: 06m 49s) [17:39:39] T410702: Enable sitenotice on mobile for Wikitech - https://phabricator.wikimedia.org/T410702 [17:39:43] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudweb1003.mgmt.eqiad.wmnet with chassis set policy GRACEFUL_RESTART and with Dell SCP reboot policy GRACEFUL [17:41:14] (03CR) 10Ahmon Dancy: buildkitd: Bump buildkit image to wmf-v0.26.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213538 (https://phabricator.wikimedia.org/T410049) (owner: 10Dduvall) [17:43:30] !log taavi@cumin1003 conftool action : set/pooled=inactive; selector: cluster=cloudweb,name=cloudweb1003.wikimedia.org [17:45:36] !log taavi@cumin1003 conftool action : set/pooled=no; selector: cluster=cloudweb,name=cloudweb1003.wikimedia.org [17:45:59] (03CR) 10Dduvall: buildkitd: Bump buildkit image to wmf-v0.26.2 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1213538 (https://phabricator.wikimedia.org/T410049) (owner: 10Dduvall) [17:47:39] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:48:21] ^ cloudweb maintenance related - afaict [17:48:55] PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:49:07] see discussion in -cloud [17:51:05] 06SRE, 06Infrastructure-Foundations, 10Mail: Emails to Google group no-reply@wikimedia.org are not being delivered - SMTP server issue? - https://phabricator.wikimedia.org/T411027#11420579 (10jhathaway) >>! In T411027#11420456, @JKelsoteel-WMF wrote: > @jhathaway I did create a group called "noreply@wikimedi... [17:52:22] (03CR) 10AOkoth: [C:03+2] admin: remove old key for aokoth [puppet] - 10https://gerrit.wikimedia.org/r/1211727 (owner: 10AOkoth) [17:52:24] (03PS1) 10Andrew Bogott: cloudweb: point striker at mcrouter, port 11213 [puppet] - 10https://gerrit.wikimedia.org/r/1213539 [17:54:19] (03CR) 10Majavah: [C:03+1] cloudweb: point striker at mcrouter, port 11213 [puppet] - 10https://gerrit.wikimedia.org/r/1213539 (owner: 10Andrew Bogott) [17:54:31] (03CR) 10Andrew Bogott: [C:03+2] cloudweb: point striker at mcrouter, port 11213 [puppet] - 10https://gerrit.wikimedia.org/r/1213539 (owner: 10Andrew Bogott) [17:56:45] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [17:59:57] !log taavi@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1800) [18:00:05] ryankemper: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikidata Query Service weekly deploy deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T1800). [18:00:27] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:00:35] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:00:55] !log taavi@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad [18:01:16] !log taavi@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad [18:01:47] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [18:01:53] RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [18:02:13] !log taavi@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad [18:03:16] !log fceratto@deploy2002 helmfile [aux-k8s-eqiad] 'sync' command on namespace 'zarcillo' for release 'main' . [18:03:19] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:04:37] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [18:04:59] (03CR) 10Dzahn: [C:03+2] releases::mediawiki: change the time when jenkins is restarted [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [18:05:20] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudweb1003.wikimedia.org with OS trixie [18:09:47] 10SRE-Access-Requests: Update SSH key for kamila - https://phabricator.wikimedia.org/T411404 (10Raine) 03NEW [18:10:31] (03PS1) 10Kamila Součková: admin: update ssh key for kamila [puppet] - 10https://gerrit.wikimedia.org/r/1213540 (https://phabricator.wikimedia.org/T411404) [18:10:46] (03CR) 10Ahmon Dancy: [C:03+1] buildkitd: Bump buildkit image to wmf-v0.26.2 [puppet] - 10https://gerrit.wikimedia.org/r/1213538 (https://phabricator.wikimedia.org/T410049) (owner: 10Dduvall) [18:18:51] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [18:20:05] (03CR) 10Subramanya Sastry: [C:03+1] Deploy Parsoid Read Views to 19 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212670 (https://phabricator.wikimedia.org/T411283) (owner: 10Arlolra) [18:24:16] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [18:30:36] (03CR) 10Dzahn: [C:03+2] buildkitd: Bump buildkit image to wmf-v0.26.2 [puppet] - 10https://gerrit.wikimedia.org/r/1213538 (https://phabricator.wikimedia.org/T410049) (owner: 10Dduvall) [18:36:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:39:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212670 (https://phabricator.wikimedia.org/T411283) (owner: 10Arlolra) [18:41:39] (03CR) 10Ssingh: [V:03+2 C:03+2] Convert README to Markdown [puppet] - 10https://gerrit.wikimedia.org/r/1213496 (owner: 10Gergő Tisza) [18:44:06] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 (owner: 10Gergő Tisza) [18:44:39] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1003.wikimedia.org with OS trixie [18:45:40] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213498 (owner: 10Gergő Tisza) [18:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:55:49] (03Abandoned) 10Kamila Součková: service::catalog: update hcaptcha-proxy entry [puppet] - 10https://gerrit.wikimedia.org/r/1212179 (https://phabricator.wikimedia.org/T411097) (owner: 10Kamila Součková) [18:56:47] (03PS1) 10BryanDavis: striker: Bump container version [puppet] - 10https://gerrit.wikimedia.org/r/1213546 (https://phabricator.wikimedia.org/T319500) [19:00:56] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003*} and A:liberica [19:03:41] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) rebooting P{lvs6003*} and A:liberica [19:04:29] (03PS2) 10Michael Große: Growth: Enable Revise Tone feature on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) [19:11:43] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003*} and A:liberica [19:14:28] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) rebooting P{lvs6003*} and A:liberica [19:15:57] (03CR) 10Michael Große: "With the adjusted description of T409606, I think this can actually move forward." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1208357 (https://phabricator.wikimedia.org/T409606) (owner: 10Michael Große) [19:18:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86252 and previous config saved to /var/cache/conftool/dbconfig/20251201-191812-marostegui.json [19:18:18] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:18:18] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:19:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:19:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:22:58] (03PS1) 10CDobbins: . [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [19:24:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:24:31] (03PS2) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [19:25:46] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003*} and A:liberica [19:26:52] (03PS1) 10Andrew Bogott: cloudweb: remove port from health check [puppet] - 10https://gerrit.wikimedia.org/r/1213550 (https://phabricator.wikimedia.org/T376277) [19:28:30] !log cdobbins@cumin2002 END (FAIL) - Cookbook sre.loadbalancer.admin (exit_code=1) rebooting P{lvs6003*} and A:liberica [19:28:53] (03CR) 10Andrew Bogott: [C:03+2] cloudweb: remove port from health check [puppet] - 10https://gerrit.wikimedia.org/r/1213550 (https://phabricator.wikimedia.org/T376277) (owner: 10Andrew Bogott) [19:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:31:06] (03CR) 10CI reject: [V:04-1] sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 (owner: 10CDobbins) [19:33:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P86253 and previous config saved to /var/cache/conftool/dbconfig/20251201-193320-marostegui.json [19:33:25] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:57] (03PS3) 10JHathaway: ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) [19:37:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:39:35] (03CR) 10JHathaway: ipxe MBR support (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [19:39:38] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [19:41:58] (03CR) 10CI reject: [V:04-1] ipxe MBR support [cookbooks] - 10https://gerrit.wikimedia.org/r/1211269 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [19:42:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:42:24] PROBLEM - PyBal IPVS diff check on lvs1019 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [19:42:25] (03CR) 10JHathaway: "@ltoscano@wikimedia.org I think this portion is ready to merge, but please let me know if you spot any errors. Thanks!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1211268 (https://phabricator.wikimedia.org/T409286) (owner: 10JHathaway) [19:43:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:45:50] (03PS1) 10Andrew Bogott: Revert "cloudweb: remove port from health check" [puppet] - 10https://gerrit.wikimedia.org/r/1213552 [19:46:16] (03PS3) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [19:46:22] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudweb: remove port from health check" [puppet] - 10https://gerrit.wikimedia.org/r/1213552 (owner: 10Andrew Bogott) [19:46:36] !log cdobbins@cumin2002 START - Cookbook sre.loadbalancer.admin rebooting P{lvs6003*} and A:liberica [19:48:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P86254 and previous config saved to /var/cache/conftool/dbconfig/20251201-194828-marostegui.json [19:49:27] !log cdobbins@cumin2002 END (PASS) - Cookbook sre.loadbalancer.admin (exit_code=0) rebooting P{lvs6003*} and A:liberica [19:50:38] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - labweb-ssl_7443: Servers cloudweb1004.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [19:53:29] (03PS4) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1213549 [19:53:56] (03PS1) 10Andrew Bogott: cloudweb: remove port from health check [puppet] - 10https://gerrit.wikimedia.org/r/1213556 (https://phabricator.wikimedia.org/T376277) [19:54:41] (03PS1) 10Urbanecm: Introduce HTML confirmation email [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213557 (https://phabricator.wikimedia.org/T396155) [19:55:41] (03PS1) 10Urbanecm: ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) [19:55:49] (03CR) 10Urbanecm: [C:03+2] Introduce HTML confirmation email [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213557 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [19:55:52] (03CR) 10Urbanecm: [C:03+2] ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [19:55:55] jouncebot: nowandnext [19:55:56] No deployments scheduled for the next 1 hour(s) and 4 minute(s) [19:55:56] In 1 hour(s) and 4 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T2100) [19:56:04] let's hope an hour is enough... [19:58:13] (03CR) 10Aaron Schulz: [C:03+1] REST: enable the site.v1 module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213515 (https://phabricator.wikimedia.org/T409516) (owner: 10BPirkle) [19:58:43] (03CR) 10Majavah: [C:03+1] cloudweb: remove port from health check [puppet] - 10https://gerrit.wikimedia.org/r/1213556 (https://phabricator.wikimedia.org/T376277) (owner: 10Andrew Bogott) [20:00:42] (03CR) 10Majavah: [C:03+2] cloudweb: remove port from health check [puppet] - 10https://gerrit.wikimedia.org/r/1213556 (https://phabricator.wikimedia.org/T376277) (owner: 10Andrew Bogott) [20:01:05] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213515 (https://phabricator.wikimedia.org/T409516) (owner: 10BPirkle) [20:02:20] !log updating envoyproxy from 1.29.x to 1.32.x on phabricator prod host [20:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86255 and previous config saved to /var/cache/conftool/dbconfig/20251201-200335-marostegui.json [20:03:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:03:42] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:03:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2178.codfw.wmnet with reason: Maintenance [20:04:00] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2178 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86256 and previous config saved to /var/cache/conftool/dbconfig/20251201-200359-marostegui.json [20:04:48] !log taavi@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad [20:06:50] 10ops-codfw, 06SRE, 06DC-Ops, 10fundraising-tech-ops: Q2:install (1) SSD each into franio200[1-3] - https://phabricator.wikimedia.org/T405982#11421101 (10Jhancock.wm) @Jgreen hey lost track of this task. I'm ususually on site from 9am to 1pm local time (central) on almost every work day. is there a time th... [20:07:16] (03CR) 10CI reject: [V:04-1] ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [20:07:21] ... [20:07:23] come on [20:07:56] (03PS2) 10Urbanecm: ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) [20:08:01] !log upgrading envoyproxy on contint1002; phab1004; T405808 [20:08:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:08:04] T405808: Upgrade Envoy to v1.32.12 - https://phabricator.wikimedia.org/T405808 [20:08:08] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:08:08] (03CR) 10Urbanecm: "..." [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [20:08:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213557 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [20:08:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [20:08:38] !log taavi@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad [20:08:41] (03PS1) 10Ebernhardson: cirrus: Apply increased near match weight on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213559 (https://phabricator.wikimedia.org/T408154) [20:09:33] !log taavi@cumin1003 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-low-traffic-eqiad [20:09:58] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213559 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [20:10:04] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [20:10:10] RECOVERY - PyBal IPVS diff check on lvs1019 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [20:10:30] !log taavi@cumin1003 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-low-traffic-eqiad [20:10:58] (03Merged) 10jenkins-bot: Introduce HTML confirmation email [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213557 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [20:11:06] here we go [20:13:54] !log jhathaway@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on sretest2001.codfw.wmnet with reason: T383173 [20:13:57] T383173: Supermicro: UEFI HTTP boot request hangs on cold boot - https://phabricator.wikimedia.org/T383173 [20:15:58] (03PS1) 10Catrope: Make sure WebAuthnKey::$supportsPasswordless is always initialized [extensions/WebAuthn] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213562 (https://phabricator.wikimedia.org/T411368) [20:16:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [extensions/WebAuthn] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213562 (https://phabricator.wikimedia.org/T411368) (owner: 10Catrope) [20:20:33] (03Merged) 10jenkins-bot: ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true [extensions/GrowthExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213558 (https://phabricator.wikimedia.org/T396155) (owner: 10Urbanecm) [20:20:39] finally [20:20:55] !log urbanecm@deploy2002 Started scap sync-world: Backport for [[gerrit:1213557|Introduce HTML confirmation email (T396155)]], [[gerrit:1213558|ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true (T396155)]] [20:20:55] (03PS1) 10Dzahn: releases: fix names of parameters for auto_restart minute/hour [puppet] - 10https://gerrit.wikimedia.org/r/1213564 (https://phabricator.wikimedia.org/T410729) [20:20:57] T396155: Improve verification email - https://phabricator.wikimedia.org/T396155 [20:21:10] (03PS2) 10Dzahn: releases: fix names of parameters for auto_restart minute/hour [puppet] - 10https://gerrit.wikimedia.org/r/1213564 (https://phabricator.wikimedia.org/T410729) [20:21:14] (03CR) 10CI reject: [V:04-1] releases: fix names of parameters for auto_restart minute/hour [puppet] - 10https://gerrit.wikimedia.org/r/1213564 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [20:21:25] (03CR) 10Dzahn: [C:03+2] releases: fix names of parameters for auto_restart minute/hour [puppet] - 10https://gerrit.wikimedia.org/r/1213564 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [20:22:26] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1213564" [puppet] - 10https://gerrit.wikimedia.org/r/1208406 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [20:26:46] !log jhathaway@cumin2002 START - Cookbook sre.hosts.provision for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:37:05] !log jhathaway@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host sretest2001.mgmt.codfw.wmnet with chassis set policy GRACEFUL_RESTART [20:37:19] 10SRE-SLO: Sloth: adapt default month view to quarter view (pilot) - https://phabricator.wikimedia.org/T409312#11421221 (10herron) Made a couple more adjustments to the dashboard to clean up the rolling window portion ` * Updated fiscal year start month to July * Rolling window: * Update panel options to... [20:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:40:45] i might be a little late for the backport window, but my deploy should be fast. [20:43:10] (03PS1) 10Dzahn: releases: time parameters for jenkins restart need to be strings [puppet] - 10https://gerrit.wikimedia.org/r/1213566 (https://phabricator.wikimedia.org/T410729) [20:44:24] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:1213557|Introduce HTML confirmation email (T396155)]], [[gerrit:1213558|ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true (T396155)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [20:44:27] T396155: Improve verification email - https://phabricator.wikimedia.org/T396155 [20:44:42] !log urbanecm@deploy2002 urbanecm: Continuing with sync [20:45:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:46:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:46:31] (03CR) 10Dzahn: [C:03+2] releases: time parameters for jenkins restart need to be strings [puppet] - 10https://gerrit.wikimedia.org/r/1213566 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [20:46:36] (03PS2) 10Dzahn: releases: time parameters for jenkins restart need to be strings [puppet] - 10https://gerrit.wikimedia.org/r/1213566 (https://phabricator.wikimedia.org/T410729) [20:50:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:51:19] !log prometheus100[78] grow /dev/vg0/prometheus-k8s-dse filesystems [20:51:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [20:54:10] (03CR) 10Dzahn: [C:03+2] releases: time parameters for jenkins restart need to be strings [puppet] - 10https://gerrit.wikimedia.org/r/1213566 (https://phabricator.wikimedia.org/T410729) (owner: 10Dzahn) [20:57:04] !log urbanecm@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213557|Introduce HTML confirmation email (T396155)]], [[gerrit:1213558|ConfirmEmailHooks: Do not run when UserEmailConfirmationUseHTML is true (T396155)]] (duration: 36m 09s) [20:57:07] T396155: Improve verification email - https://phabricator.wikimedia.org/T396155 [21:00:05] RoanKattouw, Urbanecm, TheresNoTime, kindrobot, and cjming: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T2100). [21:00:05] cscott, tgr, bpirkle, ebernhardson, and RoanKattouw: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:11] \o [21:00:18] \o [21:01:12] 10SRE-SLO: Evaluate Sloth as a possible replacement for Pyrra - https://phabricator.wikimedia.org/T404171#11421276 (10herron) [21:03:17] o/ [21:06:00] cscott: around? [21:06:04] yep [21:06:10] sorry i'm a bit late [21:06:23] no worries, are you self-deploying? [21:07:09] sure, i can do that. does anyone want to deploy their config change at the same time? [21:07:25] turning on parsoid read views is low risk these days [21:07:43] you can deploy mine, it's a noop [21:07:50] Mine is low risk as well [21:08:13] i can do all four? 1212670, 1213497, 1213498, 1213515 [21:08:17] (03PS1) 10Bvibber: StickyHeaders: fix Minerva list styling for "peeking" bullet points [extensions/ReaderExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213570 (https://phabricator.wikimedia.org/T409325) [21:08:37] mine is also low risk, it's already deployd at 50% this just switches to 100% [21:08:50] let's do them all it will be a party [21:08:53] sure [21:10:23] warning for change 1213498 about depends-on [21:10:45] the dependency is merged [21:11:19] but I guess scap doesn't know that puppet uses the production branch, not master? [21:11:20] i've also got a small css change in ReaderExperiments: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/ReaderExperiments/+/1213570 [21:11:26] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212670 (https://phabricator.wikimedia.org/T411283) (owner: 10Arlolra) [21:11:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 (owner: 10Gergő Tisza) [21:11:27] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213498 (owner: 10Gergő Tisza) [21:11:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213515 (https://phabricator.wikimedia.org/T409516) (owner: 10BPirkle) [21:11:28] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cscott@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213559 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [21:11:29] if it doesn't make it in this hour i'll schedule ittomorrow :) [21:11:30] it's a very soft dependency in any case [21:11:43] bvibber: not a config change, alas, or i would throw it in. [21:11:49] no worries! [21:11:53] but we're doing all the rest at once so there should be plenty of time for yours [21:12:00] sweet [21:12:22] RoanKattouw: has a non-config backport too [21:12:26] (03Merged) 10jenkins-bot: Deploy Parsoid Read Views to 19 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212670 (https://phabricator.wikimedia.org/T411283) (owner: 10Arlolra) [21:12:29] (03Merged) 10jenkins-bot: Change the README to Markdown [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 (owner: 10Gergő Tisza) [21:12:31] (03Merged) 10jenkins-bot: noc: Point links in /conf to Gitiles rather than Differential [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213498 (owner: 10Gergő Tisza) [21:12:40] (03Merged) 10jenkins-bot: REST: enable the site.v1 module [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213515 (https://phabricator.wikimedia.org/T409516) (owner: 10BPirkle) [21:12:42] (03Merged) 10jenkins-bot: cirrus: Apply increased near match weight on commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213559 (https://phabricator.wikimedia.org/T408154) (owner: 10Ebernhardson) [21:13:00] !log cscott@deploy2002 Started scap sync-world: Backport for [[gerrit:1212670|Deploy Parsoid Read Views to 19 wikis (T411283)]], [[gerrit:1213497|Change the README to Markdown]], [[gerrit:1213498|noc: Point links in /conf to Gitiles rather than Differential]], [[gerrit:1213515|REST: enable the site.v1 module (T409516)]], [[gerrit:1213559|cirrus: Apply increased near match weight on commonswiki (T408154)]] [21:13:09] T411283: Parsoid Read Views to deploy ~2025-12-01 - https://phabricator.wikimedia.org/T411283 [21:13:09] T409516: Create Sitemap API Module - https://phabricator.wikimedia.org/T409516 [21:13:09] T408154: AB Test doubling near match field weights on commonswiki - https://phabricator.wikimedia.org/T408154 [21:14:03] (03CR) 10Brouberol: [C:03+2] Allow access to urldownloader to airflow-main/workers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213529 (https://phabricator.wikimedia.org/T410285) (owner: 10Aqu) [21:14:42] (03PS1) 10Aleksandar Mastilovic: Add GRANT MODIFYs to aqsloader for two new pageviews tables [puppet] - 10https://gerrit.wikimedia.org/r/1213571 (https://phabricator.wikimedia.org/T410962) [21:15:20] (03CR) 10Aleksandar Mastilovic: "I think we're missing these GRANT MODIFYs" [puppet] - 10https://gerrit.wikimedia.org/r/1213571 (https://phabricator.wikimedia.org/T410962) (owner: 10Aleksandar Mastilovic) [21:16:12] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [21:16:29] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [21:16:57] !log cscott@deploy2002 cscott, ebernhardson, tgr, arlolra, bpirkle: Backport for [[gerrit:1212670|Deploy Parsoid Read Views to 19 wikis (T411283)]], [[gerrit:1213497|Change the README to Markdown]], [[gerrit:1213498|noc: Point links in /conf to Gitiles rather than Differential]], [[gerrit:1213515|REST: enable the site.v1 module (T409516)]], [[gerrit:1213559|cirrus: Apply increased near match weight on commonswiki (T408154 [21:16:57] )]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:17:32] ok it's on the test servers, test & let me know if ok to proceed, [21:17:50] mine is good [21:17:55] mine looks good [21:19:11] tgr is there anything to test? [21:20:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:20:53] tgr's is a no-op, i'm continuing the scap [21:21:02] !log cscott@deploy2002 cscott, ebernhardson, tgr, arlolra, bpirkle: Continuing with sync [21:21:18] cscott: no need, thanks [21:25:09] !log cscott@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212670|Deploy Parsoid Read Views to 19 wikis (T411283)]], [[gerrit:1213497|Change the README to Markdown]], [[gerrit:1213498|noc: Point links in /conf to Gitiles rather than Differential]], [[gerrit:1213515|REST: enable the site.v1 module (T409516)]], [[gerrit:1213559|cirrus: Apply increased near match weight on commonswiki (T408154)]] (duration: 12m [21:25:10] 09s) [21:25:15] T411283: Parsoid Read Views to deploy ~2025-12-01 - https://phabricator.wikimedia.org/T411283 [21:25:15] T409516: Create Sitemap API Module - https://phabricator.wikimedia.org/T409516 [21:25:16] T408154: AB Test doubling near match field weights on commonswiki - https://phabricator.wikimedia.org/T408154 [21:25:45] (03CR) 10Eric Gardner: [C:03+1] StickyHeaders: fix Minerva list styling for "peeking" bullet points [extensions/ReaderExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213570 (https://phabricator.wikimedia.org/T409325) (owner: 10Bvibber) [21:26:31] ok, done. On to RoanKattouw or bvibber, whoever wants the torch next [21:26:35] Thank you for deploying! [21:26:40] woot! [21:26:44] i defer to RoanKattouw if present [21:26:59] I don't think I've seen RoanKattouw on-line yet, so if you're ready bvibber i'd say go ahead [21:27:04] then i'll go ahead with my own :D [21:27:54] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bvibber@deploy2002 using scap backport" [extensions/ReaderExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213570 (https://phabricator.wikimedia.org/T409325) (owner: 10Bvibber) [21:28:34] (03CR) 10Gergő Tisza: "Apparently Gitiles doesn't understand the Markdown syntax for definition lists:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213497 (owner: 10Gergő Tisza) [21:28:52] (03Merged) 10jenkins-bot: StickyHeaders: fix Minerva list styling for "peeking" bullet points [extensions/ReaderExperiments] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213570 (https://phabricator.wikimedia.org/T409325) (owner: 10Bvibber) [21:29:13] !log bvibber@deploy2002 Started scap sync-world: Backport for [[gerrit:1213570|StickyHeaders: fix Minerva list styling for "peeking" bullet points (T409325)]] [21:29:16] T409325: StickyHeaders: Bug Bash IV: Revenge of the Son of Bug Bash (UX/UI) - https://phabricator.wikimedia.org/T409325 [21:29:26] !log dzahn@cumin2002 START - Cookbook sre.ganeti.makevm for new host planet1004.eqiad.wmnet [21:29:28] !log dzahn@cumin2002 START - Cookbook sre.dns.netbox [21:31:02] !log bvibber@deploy2002 bvibber: Backport for [[gerrit:1213570|StickyHeaders: fix Minerva list styling for "peeking" bullet points (T409325)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [21:31:47] (03PS1) 10Dzahn: site: add planet hosts with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1213575 [21:32:08] (03CR) 10Dzahn: [C:03+2] site: add planet hosts with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1213575 (owner: 10Dzahn) [21:32:19] !log bvibber@deploy2002 bvibber: Continuing with sync [21:32:21] bvibber: you're willing the prize for most exciting phab task name [21:33:57] hehe [21:34:01] we have fun at readers growth [21:35:09] dzahn@cumin2002 makevm (PID 781740) is awaiting input [21:36:21] !log bvibber@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213570|StickyHeaders: fix Minerva list styling for "peeking" bullet points (T409325)]] (duration: 07m 08s) [21:36:24] T409325: StickyHeaders: Bug Bash IV: Revenge of the Son of Bug Bash (UX/UI) - https://phabricator.wikimedia.org/T409325 [21:36:40] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet1004.eqiad.wmnet - dzahn@cumin2002" [21:36:45] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM planet1004.eqiad.wmnet - dzahn@cumin2002" [21:36:46] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:36:46] !log dzahn@cumin2002 START - Cookbook sre.dns.wipe-cache planet1004.eqiad.wmnet on all recursors [21:36:49] !log dzahn@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) planet1004.eqiad.wmnet on all recursors [21:37:01] (03PS2) 10Dzahn: Revert "site: move zuul2002 to insetup role temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1208443 [21:37:14] ok all done with mine [21:37:20] !log dzahn@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet1004.eqiad.wmnet - dzahn@cumin2002" [21:37:25] !log dzahn@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM planet1004.eqiad.wmnet - dzahn@cumin2002" [21:38:46] and confirmed in safari that it's working <3 [21:40:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86258 and previous config saved to /var/cache/conftool/dbconfig/20251201-214021-marostegui.json [21:40:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:40:26] dzahn@cumin2002 makevm (PID 781740) is awaiting input [21:40:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:41:36] RoanKattouw if you're around the torch is passed to you [21:41:58] (03CR) 10Dzahn: [C:03+2] Revert "site: move zuul2002 to insetup role temporarily" [puppet] - 10https://gerrit.wikimedia.org/r/1208443 (owner: 10Dzahn) [21:42:31] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host planet1004.eqiad.wmnet with OS trixie [21:42:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86259 and previous config saved to /var/cache/conftool/dbconfig/20251201-214247-marostegui.json [21:52:16] (03PS15) 10JHathaway: Add the sre.hosts.powercycle cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [21:52:46] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on planet1004.eqiad.wmnet with reason: host reimage [21:54:40] (03CR) 10JHathaway: Add the sre.hosts.powercycle cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1198928 (owner: 10Elukey) [21:55:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P86260 and previous config saved to /var/cache/conftool/dbconfig/20251201-215529-marostegui.json [21:57:44] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on planet1004.eqiad.wmnet with reason: host reimage [21:57:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86261 and previous config saved to /var/cache/conftool/dbconfig/20251201-215754-marostegui.json [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor My software never has bugs. It just develops random features. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251201T2200). [22:00:30] PROBLEM - Thanos swift https on thanos-fe1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [22:01:08] PROBLEM - Thanos swift https on thanos-fe1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Thanos [22:02:58] RECOVERY - Thanos swift https on thanos-fe1007 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/Thanos [22:03:20] RECOVERY - Thanos swift https on thanos-fe1004 is OK: HTTP OK: HTTP/1.1 200 OK - 279 bytes in 0.081 second response time https://wikitech.wikimedia.org/wiki/Thanos [22:08:06] (03PS1) 10Bking: opensearch on k8s: add DC-specific records for opensearch-ipoid [dns] - 10https://gerrit.wikimedia.org/r/1213580 (https://phabricator.wikimedia.org/T410956) [22:10:37] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P86262 and previous config saved to /var/cache/conftool/dbconfig/20251201-221036-marostegui.json [22:11:14] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host planet1004.eqiad.wmnet with OS trixie [22:11:14] !log dzahn@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host planet1004.eqiad.wmnet [22:13:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P86263 and previous config saved to /var/cache/conftool/dbconfig/20251201-221302-marostegui.json [22:14:40] (03PS8) 10JHathaway: UEFI: dup partition on MD RAID boxes [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) [22:17:47] 06SRE, 06collaboration-services, 05PES1.3.3 WP25 Easter Eggs: Request: Wikipedia 25 microsite hosting - https://phabricator.wikimedia.org/T408592#11421593 (10Dzahn) [22:18:22] (03CR) 10JHathaway: UEFI: dup partition on MD RAID boxes (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1205197 (https://phabricator.wikimedia.org/T376949) (owner: 10JHathaway) [22:20:47] !log dzahn@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:15:00 on zuul2002.codfw.wmnet with reason: reboot [22:25:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86264 and previous config saved to /var/cache/conftool/dbconfig/20251201-222544-marostegui.json [22:25:49] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:25:49] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:26:00] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [22:26:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1166 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86265 and previous config saved to /var/cache/conftool/dbconfig/20251201-222607-marostegui.json [22:26:41] preparing to start the security deploy [22:28:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86266 and previous config saved to /var/cache/conftool/dbconfig/20251201-222810-marostegui.json [22:28:26] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2201.codfw.wmnet with reason: Maintenance [22:30:08] Irssi 1.4.5 (20231003) - http://www.irssi.org [22:30:16] oops. sorry. [22:33:24] about to run scap [22:33:29] (03PS1) 10Dzahn: zuul: revert zuul2002 to use ferm with docker [puppet] - 10https://gerrit.wikimedia.org/r/1213582 (https://phabricator.wikimedia.org/T410756) [22:33:49] (03CR) 10Dzahn: [C:03+2] zuul: revert zuul2002 to use ferm with docker [puppet] - 10https://gerrit.wikimedia.org/r/1213582 (https://phabricator.wikimedia.org/T410756) (owner: 10Dzahn) [22:36:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:40:46] !log mstyles Deployed security patch for T411144 [22:41:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:41:26] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:43:41] (03PS1) 10Mstyles: OATHAuth: Expand 2FA to all users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1213585 [22:46:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:46:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:46:42] (03PS1) 10Bking: opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) [22:48:20] (03PS2) 10Bking: opensearch-ipoid-test: Add environment-specific values files for TLS/ingress [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213586 (https://phabricator.wikimedia.org/T410956) [22:50:17] cscott: Sorry for being late, I'll deploy my patch after maryum is done [22:50:47] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/opensearch-ipoid-test: apply [22:54:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:54:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [22:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:59:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:59:26] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [23:00:39] (03PS1) 10Dzahn: admin/releases: deprecate shell user group releasers-mwcli [puppet] - 10https://gerrit.wikimedia.org/r/1213587 [23:01:41] (03PS1) 10Jasmine: admin: Add jasmine FIDO ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1213588 [23:01:49] (03PS2) 10Dzahn: admin/releases: deprecate shell user group releasers-mwcli [puppet] - 10https://gerrit.wikimedia.org/r/1213587 [23:04:31] (03CR) 10Dzahn: "This would be fine as long as there are no more direct, manual, uploads to releases.wikimedia.org in the future." [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [23:09:05] (03PS1) 10Cwhite: logstash: drop netdev logspam [puppet] - 10https://gerrit.wikimedia.org/r/1213589 (https://phabricator.wikimedia.org/T390215) [23:09:18] (03PS1) 10JHathaway: firewall: remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1213590 [23:09:50] (03PS2) 10JHathaway: firewall: remove includes [puppet] - 10https://gerrit.wikimedia.org/r/1213590 (https://phabricator.wikimedia.org/T411089) [23:11:06] (03CR) 10JHathaway: firewall: Use virtual resources to fix ordering issues (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1211651 (https://phabricator.wikimedia.org/T411089) (owner: 10Majavah) [23:11:35] (03CR) 10Cwhite: [C:03+2] logstash: drop netdev logspam [puppet] - 10https://gerrit.wikimedia.org/r/1213589 (https://phabricator.wikimedia.org/T390215) (owner: 10Cwhite) [23:19:50] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T410589)', diff saved to https://phabricator.wikimedia.org/P86267 and previous config saved to /var/cache/conftool/dbconfig/20251201-231949-ladsgroup.json [23:19:52] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [23:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:33:40] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:34:53] (03CR) 10TrainBranchBot: [C:03+2] "Approved by catrope@deploy2002 using scap backport" [extensions/WebAuthn] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213562 (https://phabricator.wikimedia.org/T411368) (owner: 10Catrope) [23:34:57] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86268 and previous config saved to /var/cache/conftool/dbconfig/20251201-233456-ladsgroup.json [23:35:17] (03PS2) 10Jasmine: admin: Add jasmine FIDO ssh key [puppet] - 10https://gerrit.wikimedia.org/r/1213588 [23:37:42] (03Merged) 10jenkins-bot: Make sure WebAuthnKey::$supportsPasswordless is always initialized [extensions/WebAuthn] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1213562 (https://phabricator.wikimedia.org/T411368) (owner: 10Catrope) [23:38:00] !log catrope@deploy2002 Started scap sync-world: Backport for [[gerrit:1213562|Make sure WebAuthnKey::$supportsPasswordless is always initialized (T411368)]] [23:38:03] T411368: Error: Typed property MediaWiki\Extension\WebAuthn\Key\WebAuthnKey::$supportsPasswordless must not be accessed before initialization - https://phabricator.wikimedia.org/T411368 [23:39:54] !log catrope@deploy2002 catrope: Backport for [[gerrit:1213562|Make sure WebAuthnKey::$supportsPasswordless is always initialized (T411368)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [23:41:35] !log catrope@deploy2002 catrope: Continuing with sync [23:43:12] (03CR) 10Brennen Bearnes: "Seems like a good idea." [puppet] - 10https://gerrit.wikimedia.org/r/1213587 (owner: 10Dzahn) [23:44:04] 10SRE-Access-Requests, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users for Silvia G - https://phabricator.wikimedia.org/T411436#11421993 (10Novem_Linguae) [23:45:36] !log catrope@deploy2002 Finished scap sync-world: Backport for [[gerrit:1213562|Make sure WebAuthnKey::$supportsPasswordless is always initialized (T411368)]] (duration: 07m 36s) [23:45:39] T411368: Error: Typed property MediaWiki\Extension\WebAuthn\Key\WebAuthnKey::$supportsPasswordless must not be accessed before initialization - https://phabricator.wikimedia.org/T411368 [23:49:13] (03CR) 10Cwhite: [C:03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1212529 (owner: 10Muehlenhoff) [23:50:05] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86269 and previous config saved to /var/cache/conftool/dbconfig/20251201-235004-ladsgroup.json [23:51:12] PROBLEM - mailman list info ssl expiry on lists1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring