[00:09:10] (03PS1) 10E75ti: jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 [00:10:44] (03CR) 10CI reject: [V:04-1] jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 (owner: 10E75ti) [00:14:28] (03PS2) 10E75ti: jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 [00:15:59] (03CR) 10CI reject: [V:04-1] jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 (owner: 10E75ti) [00:22:43] (03PS3) 10E75ti: jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 [00:24:12] (03CR) 10CI reject: [V:04-1] jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 (owner: 10E75ti) [00:30:10] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:39:47] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213074 [00:39:47] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213074 (owner: 10TrainBranchBot) [00:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:41:39] (03CR) 10E75ti: [C:04-1] "Sorry about that, my bad. Schemas need to be fully upgraded. This is wrong. Abandoning until sometime in the future." [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 (owner: 10E75ti) [00:41:47] (03Abandoned) 10E75ti: jsonschema: FormatChecker.cls_checks is depreceated [homer/public] - 10https://gerrit.wikimedia.org/r/1213072 (owner: 10E75ti) [00:51:47] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1213074 (owner: 10TrainBranchBot) [00:56:34] (03PS1) 10E75ti: Homer: add WIP parallelization [software/homer] - 10https://gerrit.wikimedia.org/r/1213075 [01:00:53] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:09:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213076 [01:09:48] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213076 (owner: 10TrainBranchBot) [01:13:16] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 12m 22s) [01:33:34] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1213076 (owner: 10TrainBranchBot) [02:02:46] PROBLEM - snapshot of s4 in codfw on backupmon1001 is CRITICAL: snapshot for s4 at codfw (db2239) taken more than 3 days ago: Most recent backup 2025-11-27 01:55:26 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [02:55:10] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:02:28] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T410589)', diff saved to https://phabricator.wikimedia.org/P86151 and previous config saved to /var/cache/conftool/dbconfig/20251130-030228-ladsgroup.json [03:02:35] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:15:10] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [03:17:37] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P86152 and previous config saved to /var/cache/conftool/dbconfig/20251130-031735-ladsgroup.json [03:22:17] FIRING: [14x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:27:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:29:59] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [03:30:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:32:45] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259', diff saved to https://phabricator.wikimedia.org/P86153 and previous config saved to /var/cache/conftool/dbconfig/20251130-033244-ladsgroup.json [03:39:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2021:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:44:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:47:52] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1259 (T410589)', diff saved to https://phabricator.wikimedia.org/P86154 and previous config saved to /var/cache/conftool/dbconfig/20251130-034752-ladsgroup.json [03:47:58] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [03:48:08] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [03:49:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [03:59:02] FIRING: [5x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:30:10] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:34:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:54:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:57:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:04:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:09:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:09:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:12:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:14:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:14:59] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:18:42] PROBLEM - snapshot of s3 in codfw on backupmon1001 is CRITICAL: snapshot for s3 at codfw (db2239) taken more than 3 days ago: Most recent backup 2025-11-27 04:56:09 https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [05:22:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:24:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:29:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:34:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:37:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:39:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:42:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:44:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:47:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:50:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:51:32] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2008 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:52:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:52:26] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2008 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:54:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:57:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:58:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:59:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:02:17] FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:04:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:06:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:07:17] FIRING: [13x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:09:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:11:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:12:17] FIRING: [13x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:17:17] FIRING: [15x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:22:17] FIRING: [14x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:24:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:27:17] FIRING: [9x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:29:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2010:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:32:17] FIRING: [9x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:37:17] FIRING: [9x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:42:17] FIRING: [6x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:47:17] RESOLVED: [5x] ProbeDown: Service wdqs2011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:48:17] FIRING: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:52:32] RESOLVED: [2x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:53:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2141.codfw.wmnet with reason: Maintenance [06:55:10] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:55:19] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1159.eqiad.wmnet with reason: Maintenance [06:55:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86155 and previous config saved to /var/cache/conftool/dbconfig/20251130-065526-marostegui.json [06:55:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [06:55:33] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [06:57:58] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2145.codfw.wmnet with reason: Maintenance [06:58:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2145 (T410531)', diff saved to https://phabricator.wikimedia.org/P86156 and previous config saved to /var/cache/conftool/dbconfig/20251130-065805-marostegui.json [06:58:12] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:04:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T410531)', diff saved to https://phabricator.wikimedia.org/P86157 and previous config saved to /var/cache/conftool/dbconfig/20251130-070430-marostegui.json [07:04:36] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:15:10] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [07:19:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P86158 and previous config saved to /var/cache/conftool/dbconfig/20251130-071938-marostegui.json [07:30:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:34:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P86159 and previous config saved to /var/cache/conftool/dbconfig/20251130-073445-marostegui.json [07:49:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T410531)', diff saved to https://phabricator.wikimedia.org/P86160 and previous config saved to /var/cache/conftool/dbconfig/20251130-074953-marostegui.json [07:49:59] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:50:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2146.codfw.wmnet with reason: Maintenance [07:50:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2146 (T410531)', diff saved to https://phabricator.wikimedia.org/P86161 and previous config saved to /var/cache/conftool/dbconfig/20251130-075017-marostegui.json [07:56:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T410531)', diff saved to https://phabricator.wikimedia.org/P86162 and previous config saved to /var/cache/conftool/dbconfig/20251130-075642-marostegui.json [07:56:48] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251130T0800) [08:11:51] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P86163 and previous config saved to /var/cache/conftool/dbconfig/20251130-081150-marostegui.json [08:26:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146', diff saved to https://phabricator.wikimedia.org/P86164 and previous config saved to /var/cache/conftool/dbconfig/20251130-082657-marostegui.json [08:30:10] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:42:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2146 (T410531)', diff saved to https://phabricator.wikimedia.org/P86165 and previous config saved to /var/cache/conftool/dbconfig/20251130-084205-marostegui.json [08:42:12] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:42:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2153.codfw.wmnet with reason: Maintenance [08:42:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2153 (T410531)', diff saved to https://phabricator.wikimedia.org/P86166 and previous config saved to /var/cache/conftool/dbconfig/20251130-084229-marostegui.json [08:48:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T410531)', diff saved to https://phabricator.wikimedia.org/P86167 and previous config saved to /var/cache/conftool/dbconfig/20251130-084851-marostegui.json [08:48:57] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:03:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P86168 and previous config saved to /var/cache/conftool/dbconfig/20251130-090358-marostegui.json [09:19:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153', diff saved to https://phabricator.wikimedia.org/P86169 and previous config saved to /var/cache/conftool/dbconfig/20251130-091906-marostegui.json [09:34:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2153 (T410531)', diff saved to https://phabricator.wikimedia.org/P86170 and previous config saved to /var/cache/conftool/dbconfig/20251130-093414-marostegui.json [09:34:21] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:34:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2170.codfw.wmnet with reason: Maintenance [09:34:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2170 (T410531)', diff saved to https://phabricator.wikimedia.org/P86171 and previous config saved to /var/cache/conftool/dbconfig/20251130-093438-marostegui.json [09:40:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T410531)', diff saved to https://phabricator.wikimedia.org/P86172 and previous config saved to /var/cache/conftool/dbconfig/20251130-094058-marostegui.json [09:41:04] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:41:29] (03PS2) 10Daniel Kinzler: rest-gateway: add prefix to all user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212239 [09:53:30] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:53:52] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [09:54:10] FIRING: [2x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqdfw:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [09:55:39] FIRING: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [09:56:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P86173 and previous config saved to /var/cache/conftool/dbconfig/20251130-095605-marostegui.json [09:59:10] FIRING: [3x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:00:39] RESOLVED: CoreBGPDown: Core BGP session down between cr2-magru and cr2-eqdfw (2a02:ec80:700:fe0b::1) - group Confed_codfw - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=magru&var-device=cr2-magru:9804&var-bgp_group=Confed_codfw&var-bgp_neighbor=cr2-eqdfw - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [10:02:30] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:02:52] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:04:10] RESOLVED: [4x] BFDdown: BFD session down between cr2-eqdfw and 195.200.68.153 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [10:07:52] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:08:52] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:11:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170', diff saved to https://phabricator.wikimedia.org/P86174 and previous config saved to /var/cache/conftool/dbconfig/20251130-101113-marostegui.json [10:12:30] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:14:54] PROBLEM - OSPF status on cr2-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:16:52] RECOVERY - OSPF status on cr2-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:17:30] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [10:21:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86175 and previous config saved to /var/cache/conftool/dbconfig/20251130-102132-marostegui.json [10:21:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [10:21:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [10:26:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2170 (T410531)', diff saved to https://phabricator.wikimedia.org/P86176 and previous config saved to /var/cache/conftool/dbconfig/20251130-102620-marostegui.json [10:26:27] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:26:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2173.codfw.wmnet with reason: Maintenance [10:26:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2173 (T410531)', diff saved to https://phabricator.wikimedia.org/P86177 and previous config saved to /var/cache/conftool/dbconfig/20251130-102644-marostegui.json [10:33:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T410531)', diff saved to https://phabricator.wikimedia.org/P86178 and previous config saved to /var/cache/conftool/dbconfig/20251130-103311-marostegui.json [10:33:17] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:36:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P86179 and previous config saved to /var/cache/conftool/dbconfig/20251130-103640-marostegui.json [10:48:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P86180 and previous config saved to /var/cache/conftool/dbconfig/20251130-104818-marostegui.json [10:51:48] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159', diff saved to https://phabricator.wikimedia.org/P86181 and previous config saved to /var/cache/conftool/dbconfig/20251130-105147-marostegui.json [10:55:10] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:03:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173', diff saved to https://phabricator.wikimedia.org/P86182 and previous config saved to /var/cache/conftool/dbconfig/20251130-110326-marostegui.json [11:06:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1159 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86183 and previous config saved to /var/cache/conftool/dbconfig/20251130-110655-marostegui.json [11:07:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [11:07:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [11:07:12] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1161.eqiad.wmnet with reason: Maintenance [11:07:31] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1016,1020].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:07:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1161 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86184 and previous config saved to /var/cache/conftool/dbconfig/20251130-110739-marostegui.json [11:15:10] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:18:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2173 (T410531)', diff saved to https://phabricator.wikimedia.org/P86185 and previous config saved to /var/cache/conftool/dbconfig/20251130-111833-marostegui.json [11:18:40] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:18:50] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2174.codfw.wmnet with reason: Maintenance [11:18:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2174 (T410531)', diff saved to https://phabricator.wikimedia.org/P86186 and previous config saved to /var/cache/conftool/dbconfig/20251130-111857-marostegui.json [11:19:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:25:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T410531)', diff saved to https://phabricator.wikimedia.org/P86187 and previous config saved to /var/cache/conftool/dbconfig/20251130-112523-marostegui.json [11:25:30] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:29:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [11:30:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:40:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P86188 and previous config saved to /var/cache/conftool/dbconfig/20251130-114031-marostegui.json [11:55:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174', diff saved to https://phabricator.wikimedia.org/P86189 and previous config saved to /var/cache/conftool/dbconfig/20251130-115539-marostegui.json [12:10:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2174 (T410531)', diff saved to https://phabricator.wikimedia.org/P86190 and previous config saved to /var/cache/conftool/dbconfig/20251130-121046-marostegui.json [12:10:53] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:11:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2176.codfw.wmnet with reason: Maintenance [12:11:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2176 (T410531)', diff saved to https://phabricator.wikimedia.org/P86191 and previous config saved to /var/cache/conftool/dbconfig/20251130-121110-marostegui.json [12:17:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T410531)', diff saved to https://phabricator.wikimedia.org/P86192 and previous config saved to /var/cache/conftool/dbconfig/20251130-121734-marostegui.json [12:17:41] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:30:10] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:32:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P86193 and previous config saved to /var/cache/conftool/dbconfig/20251130-123242-marostegui.json [12:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:47:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P86194 and previous config saved to /var/cache/conftool/dbconfig/20251130-124750-marostegui.json [13:02:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T410531)', diff saved to https://phabricator.wikimedia.org/P86195 and previous config saved to /var/cache/conftool/dbconfig/20251130-130257-marostegui.json [13:03:04] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:03:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2188.codfw.wmnet with reason: Maintenance [13:03:22] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2188 (T410531)', diff saved to https://phabricator.wikimedia.org/P86196 and previous config saved to /var/cache/conftool/dbconfig/20251130-130321-marostegui.json [13:09:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T410531)', diff saved to https://phabricator.wikimedia.org/P86197 and previous config saved to /var/cache/conftool/dbconfig/20251130-130913-marostegui.json [13:09:19] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:24:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P86198 and previous config saved to /var/cache/conftool/dbconfig/20251130-132420-marostegui.json [13:39:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P86199 and previous config saved to /var/cache/conftool/dbconfig/20251130-133928-marostegui.json [13:54:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T410531)', diff saved to https://phabricator.wikimedia.org/P86200 and previous config saved to /var/cache/conftool/dbconfig/20251130-135435-marostegui.json [13:54:42] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:54:52] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2202.codfw.wmnet with reason: Maintenance [13:58:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2203.codfw.wmnet with reason: Maintenance [13:59:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2203 (T410531)', diff saved to https://phabricator.wikimedia.org/P86201 and previous config saved to /var/cache/conftool/dbconfig/20251130-135906-marostegui.json [14:04:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T410531)', diff saved to https://phabricator.wikimedia.org/P86202 and previous config saved to /var/cache/conftool/dbconfig/20251130-140458-marostegui.json [14:05:05] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:09:02] (03PS1) 10Daniel Kinzler: rest-gateway: extract Lua code for testability [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 [14:20:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P86203 and previous config saved to /var/cache/conftool/dbconfig/20251130-142006-marostegui.json [14:32:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86204 and previous config saved to /var/cache/conftool/dbconfig/20251130-143255-marostegui.json [14:33:03] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:33:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:35:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P86205 and previous config saved to /var/cache/conftool/dbconfig/20251130-143513-marostegui.json [14:48:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P86206 and previous config saved to /var/cache/conftool/dbconfig/20251130-144802-marostegui.json [14:50:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T410531)', diff saved to https://phabricator.wikimedia.org/P86207 and previous config saved to /var/cache/conftool/dbconfig/20251130-145020-marostegui.json [14:50:26] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:50:37] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2216.codfw.wmnet with reason: Maintenance [14:50:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2216 (T410531)', diff saved to https://phabricator.wikimedia.org/P86208 and previous config saved to /var/cache/conftool/dbconfig/20251130-145043-marostegui.json [14:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:56:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T410531)', diff saved to https://phabricator.wikimedia.org/P86209 and previous config saved to /var/cache/conftool/dbconfig/20251130-145634-marostegui.json [14:56:40] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [15:03:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P86210 and previous config saved to /var/cache/conftool/dbconfig/20251130-150310-marostegui.json [15:09:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P86211 and previous config saved to /var/cache/conftool/dbconfig/20251130-151141-marostegui.json [15:18:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86212 and previous config saved to /var/cache/conftool/dbconfig/20251130-151817-marostegui.json [15:18:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [15:18:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [15:18:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1185.eqiad.wmnet with reason: Maintenance [15:18:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1185 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86213 and previous config saved to /var/cache/conftool/dbconfig/20251130-151841-marostegui.json [15:26:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216', diff saved to https://phabricator.wikimedia.org/P86214 and previous config saved to /var/cache/conftool/dbconfig/20251130-152649-marostegui.json [15:29:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [15:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:34:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2216 (T410531)', diff saved to https://phabricator.wikimedia.org/P86215 and previous config saved to /var/cache/conftool/dbconfig/20251130-154157-marostegui.json [15:42:04] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [16:30:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:38:10] FIRING: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:40:11] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:43:10] RESOLVED: BFDdown: BFD session down between cr2-esams and fe80::ee38:7300:17e8:9c56 - https://wikitech.wikimedia.org/wiki/Network_monitoring#BFD_status - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-esams:9804 - https://alerts.wikimedia.org/?q=alertname%3DBFDdown [16:59:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86216 and previous config saved to /var/cache/conftool/dbconfig/20251130-165910-marostegui.json [16:59:19] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [16:59:20] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:14:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P86217 and previous config saved to /var/cache/conftool/dbconfig/20251130-171418-marostegui.json [17:29:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185', diff saved to https://phabricator.wikimedia.org/P86218 and previous config saved to /var/cache/conftool/dbconfig/20251130-172925-marostegui.json [17:44:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1185 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86219 and previous config saved to /var/cache/conftool/dbconfig/20251130-174433-marostegui.json [17:44:40] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [17:44:41] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [17:44:49] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1200.eqiad.wmnet with reason: Maintenance [17:44:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1200 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86220 and previous config saved to /var/cache/conftool/dbconfig/20251130-174456-marostegui.json [18:13:13] (03PS18) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [18:35:44] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: extract Lua code for testability (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler) [18:40:50] (03CR) 10Daniel Kinzler: [C:04-1] rest-gateway: extract Lua code for testability (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213092 (owner: 10Daniel Kinzler) [18:55:10] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:16:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:17:20] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:21:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:22:20] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:24:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86221 and previous config saved to /var/cache/conftool/dbconfig/20251130-192424-marostegui.json [19:24:32] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:24:33] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:30:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:30:11] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [19:37:19] (03PS10) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [19:39:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P86222 and previous config saved to /var/cache/conftool/dbconfig/20251130-193931-marostegui.json [19:54:40] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200', diff saved to https://phabricator.wikimedia.org/P86223 and previous config saved to /var/cache/conftool/dbconfig/20251130-195439-marostegui.json [20:09:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1200 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86224 and previous config saved to /var/cache/conftool/dbconfig/20251130-200947-marostegui.json [20:09:54] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:09:55] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:10:03] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1207.eqiad.wmnet with reason: Maintenance [20:10:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1207 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86225 and previous config saved to /var/cache/conftool/dbconfig/20251130-201010-marostegui.json [20:30:11] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [21:36:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86226 and previous config saved to /var/cache/conftool/dbconfig/20251130-213634-marostegui.json [21:36:42] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:36:43] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:51:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P86227 and previous config saved to /var/cache/conftool/dbconfig/20251130-215142-marostegui.json [22:06:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P86228 and previous config saved to /var/cache/conftool/dbconfig/20251130-220650-marostegui.json [22:09:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:11:17] FIRING: [4x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:13:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:16:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:18:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:19:09] (03CR) 10Thcipriani: [C:03+1] Mark Tyler as group approver for deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1212057 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [22:19:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:21:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86229 and previous config saved to /var/cache/conftool/dbconfig/20251130-222157-marostegui.json [22:22:04] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:22:05] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:22:14] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1216.eqiad.wmnet with reason: Maintenance [22:23:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:24:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:24:59] (03PS1) 10Cory Massaro: wikifunctions: Downgrade evaluators from 2025-11-17-175029 to 2025-11-14-022545. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213111 [22:25:14] (03CR) 10Cory Massaro: [C:03+2] wikifunctions: Downgrade evaluators from 2025-11-17-175029 to 2025-11-14-022545. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213111 (owner: 10Cory Massaro) [22:26:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:27:15] (03Merged) 10jenkins-bot: wikifunctions: Downgrade evaluators from 2025-11-17-175029 to 2025-11-14-022545. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1213111 (owner: 10Cory Massaro) [22:28:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1011:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:29:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:34:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [22:43:19] !log apine@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [22:43:54] !log apine@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [22:44:44] !log apine@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [22:45:00] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [22:45:34] !log apine@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [22:45:47] !log apine@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [22:46:21] 10ops-eqiad, 06SRE, 06DC-Ops: Q2:rack/setup/install Toolforge - https://phabricator.wikimedia.org/T410403#11418021 (10Andrew) a:05Andrew→03None [22:46:33] !log apine@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [22:51:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [22:53:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [22:55:11] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:56:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [23:08:02] FIRING: [10x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:09:47] (03PS2) 10Andrew Bogott: P:cloudceph::osd: Convert drange to an array [puppet] - 10https://gerrit.wikimedia.org/r/1212138 (owner: 10Majavah) [23:09:49] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212138 (owner: 10Majavah) [23:14:57] (03CR) 10Andrew Bogott: [C:03+1] P:cloudceph::osd: Convert drange to an array [puppet] - 10https://gerrit.wikimedia.org/r/1212138 (owner: 10Majavah) [23:21:16] RECOVERY - snapshot of s2 in eqiad on backupmon1001 is OK: Last snapshot for s2 at eqiad (db1225) taken on 2025-11-30 22:20:07 (878 GiB, +1.0 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [23:30:10] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [23:30:11] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [23:56:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh