[00:00:55] (03CR) 10DDesouza: Deploy experiment for 2025 Global Readers Survey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1210729 (https://phabricator.wikimedia.org/T410696) (owner: 10DDesouza) [00:01:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P86007 and previous config saved to /var/cache/conftool/dbconfig/20251128-000151-marostegui.json [00:02:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:06:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252', diff saved to https://phabricator.wikimedia.org/P86008 and previous config saved to /var/cache/conftool/dbconfig/20251128-000604-marostegui.json [00:07:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:09:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:09:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:14:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:14:26] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:16:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247', diff saved to https://phabricator.wikimedia.org/P86009 and previous config saved to /var/cache/conftool/dbconfig/20251128-001658-marostegui.json [00:21:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1252 (T410531)', diff saved to https://phabricator.wikimedia.org/P86010 and previous config saved to /var/cache/conftool/dbconfig/20251128-002111-marostegui.json [00:21:17] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [00:21:27] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1260.eqiad.wmnet with reason: Maintenance [00:21:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1260 (T410531)', diff saved to https://phabricator.wikimedia.org/P86011 and previous config saved to /var/cache/conftool/dbconfig/20251128-002134-marostegui.json [00:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:30:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:30:31] FIRING: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:32:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2247 (T410531)', diff saved to https://phabricator.wikimedia.org/P86012 and previous config saved to /var/cache/conftool/dbconfig/20251128-003206-marostegui.json [00:32:13] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [00:32:24] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2248.codfw.wmnet with reason: Maintenance [00:32:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2248 (T410531)', diff saved to https://phabricator.wikimedia.org/P86013 and previous config saved to /var/cache/conftool/dbconfig/20251128-003231-marostegui.json [00:35:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [00:35:31] RESOLVED: CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [00:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [00:40:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T410531)', diff saved to https://phabricator.wikimedia.org/P86014 and previous config saved to /var/cache/conftool/dbconfig/20251128-004046-marostegui.json [00:40:52] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [00:41:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1212287 [00:41:39] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1212287 (owner: 10TrainBranchBot) [00:49:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T410531)', diff saved to https://phabricator.wikimedia.org/P86015 and previous config saved to /var/cache/conftool/dbconfig/20251128-004901-marostegui.json [00:49:07] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [00:55:05] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1212287 (owner: 10TrainBranchBot) [00:55:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P86016 and previous config saved to /var/cache/conftool/dbconfig/20251128-005553-marostegui.json [01:00:50] !log mwpresync@deploy2002 Started scap build-images: Publishing wmf/next image [01:04:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P86017 and previous config saved to /var/cache/conftool/dbconfig/20251128-010408-marostegui.json [01:04:09] (03PS1) 10Tim Starling: Fix accidentally removed stylesheet [extensions/TemplateSandbox] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212292 (https://phabricator.wikimedia.org/T279736) [01:06:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by tstarling@deploy2002 using scap backport" [extensions/TemplateSandbox] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212292 (https://phabricator.wikimedia.org/T279736) (owner: 10Tim Starling) [01:08:43] (03Merged) 10jenkins-bot: Fix accidentally removed stylesheet [extensions/TemplateSandbox] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212292 (https://phabricator.wikimedia.org/T279736) (owner: 10Tim Starling) [01:10:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1212297 [01:10:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1212297 (owner: 10TrainBranchBot) [01:11:02] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260', diff saved to https://phabricator.wikimedia.org/P86018 and previous config saved to /var/cache/conftool/dbconfig/20251128-011101-marostegui.json [01:14:05] !log mwpresync@deploy2002 Finished scap build-images: Publishing wmf/next image (duration: 13m 15s) [01:14:19] !log tstarling@deploy2002 Started scap sync-world: Backport for [[gerrit:1212292|Fix accidentally removed stylesheet (T279736)]] [01:14:25] T279736: "Preview page with this template" should only accept/suggest pages that transclude the template - https://phabricator.wikimedia.org/T279736 [01:16:22] !log tstarling@deploy2002 tstarling: Backport for [[gerrit:1212292|Fix accidentally removed stylesheet (T279736)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [01:18:41] !log tstarling@deploy2002 tstarling: Continuing with sync [01:19:17] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248', diff saved to https://phabricator.wikimedia.org/P86019 and previous config saved to /var/cache/conftool/dbconfig/20251128-011916-marostegui.json [01:23:53] !log tstarling@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212292|Fix accidentally removed stylesheet (T279736)]] (duration: 09m 33s) [01:23:59] T279736: "Preview page with this template" should only accept/suggest pages that transclude the template - https://phabricator.wikimedia.org/T279736 [01:26:09] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1260 (T410531)', diff saved to https://phabricator.wikimedia.org/P86020 and previous config saved to /var/cache/conftool/dbconfig/20251128-012608-marostegui.json [01:26:15] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [01:26:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1261.eqiad.wmnet with reason: Maintenance [01:26:33] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1261 (T410531)', diff saved to https://phabricator.wikimedia.org/P86021 and previous config saved to /var/cache/conftool/dbconfig/20251128-012633-marostegui.json [01:33:56] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1212297 (owner: 10TrainBranchBot) [01:34:24] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2248 (T410531)', diff saved to https://phabricator.wikimedia.org/P86022 and previous config saved to /var/cache/conftool/dbconfig/20251128-013423-marostegui.json [01:34:30] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [01:42:15] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T410531)', diff saved to https://phabricator.wikimedia.org/P86023 and previous config saved to /var/cache/conftool/dbconfig/20251128-014214-marostegui.json [01:42:21] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [01:57:23] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P86024 and previous config saved to /var/cache/conftool/dbconfig/20251128-015722-marostegui.json [02:12:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261', diff saved to https://phabricator.wikimedia.org/P86025 and previous config saved to /var/cache/conftool/dbconfig/20251128-021229-marostegui.json [02:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [02:27:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1261 (T410531)', diff saved to https://phabricator.wikimedia.org/P86026 and previous config saved to /var/cache/conftool/dbconfig/20251128-022737-marostegui.json [02:27:43] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [02:27:53] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1262.eqiad.wmnet with reason: Maintenance [02:28:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1262 (T410531)', diff saved to https://phabricator.wikimedia.org/P86027 and previous config saved to /var/cache/conftool/dbconfig/20251128-022801-marostegui.json [02:44:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T410531)', diff saved to https://phabricator.wikimedia.org/P86028 and previous config saved to /var/cache/conftool/dbconfig/20251128-024403-marostegui.json [02:44:10] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [02:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [02:59:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P86029 and previous config saved to /var/cache/conftool/dbconfig/20251128-025911-marostegui.json [03:08:57] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - kubemaster_6443: Servers wikikube-ctrl1004.eqiad.wmnet, wikikube-ctrl1002.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [03:09:57] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [03:14:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262', diff saved to https://phabricator.wikimedia.org/P86030 and previous config saved to /var/cache/conftool/dbconfig/20251128-031418-marostegui.json [03:29:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1262 (T410531)', diff saved to https://phabricator.wikimedia.org/P86031 and previous config saved to /var/cache/conftool/dbconfig/20251128-032926-marostegui.json [03:29:32] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [03:29:37] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [03:29:42] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1263.eqiad.wmnet with reason: Maintenance [03:29:50] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1263 (T410531)', diff saved to https://phabricator.wikimedia.org/P86032 and previous config saved to /var/cache/conftool/dbconfig/20251128-032949-marostegui.json [03:44:57] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T410531)', diff saved to https://phabricator.wikimedia.org/P86033 and previous config saved to /var/cache/conftool/dbconfig/20251128-034457-marostegui.json [03:45:03] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [03:58:17] FIRING: [7x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:00:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P86034 and previous config saved to /var/cache/conftool/dbconfig/20251128-040004-marostegui.json [04:03:17] FIRING: [24x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:17] FIRING: [28x] ProbeDown: Service wdqs1011:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:08:59] FIRING: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [04:10:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:13:17] FIRING: [25x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:15:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:15:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263', diff saved to https://phabricator.wikimedia.org/P86035 and previous config saved to /var/cache/conftool/dbconfig/20251128-041511-marostegui.json [04:22:47] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2151.codfw.wmnet with reason: Maintenance [04:22:55] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2151 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86036 and previous config saved to /var/cache/conftool/dbconfig/20251128-042254-marostegui.json [04:23:02] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [04:23:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [04:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [04:28:36] (03PS1) 10Marostegui: clouddb102[23]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1212343 (https://phabricator.wikimedia.org/T409557) [04:29:11] (03CR) 10Marostegui: [C:03+2] clouddb102[23]: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1212343 (https://phabricator.wikimedia.org/T409557) (owner: 10Marostegui) [04:30:02] FIRING: [4x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:30:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1263 (T410531)', diff saved to https://phabricator.wikimedia.org/P86037 and previous config saved to /var/cache/conftool/dbconfig/20251128-043018-marostegui.json [04:30:25] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [04:30:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [04:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [04:40:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [04:47:48] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1156.eqiad.wmnet with reason: Maintenance [04:47:56] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:49:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1222.eqiad.wmnet with reason: Maintenance [04:54:30] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2148.codfw.wmnet with reason: Maintenance [04:54:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2148 (T410531)', diff saved to https://phabricator.wikimedia.org/P86038 and previous config saved to /var/cache/conftool/dbconfig/20251128-045437-marostegui.json [04:54:45] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [05:01:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T410531)', diff saved to https://phabricator.wikimedia.org/P86039 and previous config saved to /var/cache/conftool/dbconfig/20251128-050106-marostegui.json [05:01:12] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [05:05:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:10:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:16:14] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86040 and previous config saved to /var/cache/conftool/dbconfig/20251128-051613-marostegui.json [05:18:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:23:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:28:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:30:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:31:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P86041 and previous config saved to /var/cache/conftool/dbconfig/20251128-053121-marostegui.json [05:33:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:33:59] RESOLVED: SLOMetricAbsent: wdqs-main-update-lag codfw - https://slo.wikimedia.org/?search=wdqs-main-update-lag - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [05:35:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:36:29] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1014,1018].eqiad.wmnet,db[1155-1156].eqiad.wmnet with reason: Schema change [05:38:17] FIRING: [18x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:40:02] FIRING: [9x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:41:49] (03CR) 10Marostegui: "I am not familiar with MW coding enough to +1, but I can say that there are no hosts assigned to any groups since Wed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1211664 (https://phabricator.wikimedia.org/T411088) (owner: 10Ladsgroup) [05:45:02] FIRING: [8x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:46:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T410531)', diff saved to https://phabricator.wikimedia.org/P86042 and previous config saved to /var/cache/conftool/dbconfig/20251128-054628-marostegui.json [05:46:34] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [05:46:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2175.codfw.wmnet with reason: Maintenance [05:46:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2175 (T410531)', diff saved to https://phabricator.wikimedia.org/P86043 and previous config saved to /var/cache/conftool/dbconfig/20251128-054641-marostegui.json [05:48:17] FIRING: [12x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:50:02] FIRING: [7x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:53:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T410531)', diff saved to https://phabricator.wikimedia.org/P86044 and previous config saved to /var/cache/conftool/dbconfig/20251128-055303-marostegui.json [05:53:09] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [05:53:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:55:02] FIRING: [6x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:55:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [05:57:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2011:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:59:05] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db2175 gradually with 4 steps - After schema change [06:02:43] RESOLVED: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2011:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [06:03:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:05:02] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:06:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2204.codfw.wmnet with reason: Maintenance [06:11:05] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1160.eqiad.wmnet with reason: Maintenance [06:24:37] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [06:27:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11414812 (10Jclark-ctr) a:05bking→03Jclark-ctr [06:28:32] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1184.eqiad.wmnet with reason: Maintenance [06:29:26] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db2175 gradually with 4 steps - After schema change [06:33:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1186.eqiad.wmnet with reason: Maintenance [06:33:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1186 (T410531)', diff saved to https://phabricator.wikimedia.org/P86048 and previous config saved to /var/cache/conftool/dbconfig/20251128-063341-marostegui.json [06:33:56] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [06:40:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T410531)', diff saved to https://phabricator.wikimedia.org/P86049 and previous config saved to /var/cache/conftool/dbconfig/20251128-064028-marostegui.json [06:40:34] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [06:40:46] !log marostegui@cumin1003 START - Cookbook sre.mysql.pool db1184 gradually with 4 steps - After testing [06:50:48] (03CR) 10Muehlenhoff: "Can be abandoned, access was alreay granted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/1199484" [puppet] - 10https://gerrit.wikimedia.org/r/1196894 (https://phabricator.wikimedia.org/T406590) (owner: 10Ladsgroup) [06:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [06:55:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P86051 and previous config saved to /var/cache/conftool/dbconfig/20251128-065536-marostegui.json [06:57:59] !log arnaudb@cumin1003 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251128T0700) [07:10:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P86053 and previous config saved to /var/cache/conftool/dbconfig/20251128-071043-marostegui.json [07:25:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T410531)', diff saved to https://phabricator.wikimedia.org/P86055 and previous config saved to /var/cache/conftool/dbconfig/20251128-072551-marostegui.json [07:25:58] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:26:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1195.eqiad.wmnet with reason: Maintenance [07:26:13] !log marostegui@cumin1003 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1184 gradually with 4 steps - After testing [07:29:37] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [07:31:40] (03CR) 10Slyngshede: [C:03+1] Mark Tyler as group approver for deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/1212057 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:32:03] (03CR) 10Slyngshede: [C:03+1] Add Guillaume as appprover for analytics-search-admins [puppet] - 10https://gerrit.wikimedia.org/r/1212061 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:32:58] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T410531)', diff saved to https://phabricator.wikimedia.org/P86057 and previous config saved to /var/cache/conftool/dbconfig/20251128-073257-marostegui.json [07:33:03] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [07:33:05] (03CR) 10Slyngshede: [C:03+1] Remove unused cassandra-test-roots group [puppet] - 10https://gerrit.wikimedia.org/r/1212140 (owner: 10Muehlenhoff) [07:33:37] (03CR) 10Slyngshede: [C:03+1] Add Guillaume as approver for two more analytics groups [puppet] - 10https://gerrit.wikimedia.org/r/1212168 (https://phabricator.wikimedia.org/T276465) (owner: 10Muehlenhoff) [07:35:39] (03PS1) 10Brouberol: Define the testkicthen discovery and public recoords [dns] - 10https://gerrit.wikimedia.org/r/1212400 (https://phabricator.wikimedia.org/T407805) [07:36:24] (03CR) 10Pmiazga: "LGTM, this is really interesting. I'll add to my todo to check this code better once I'm back. Nice job" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212107 (owner: 10Daniel Kinzler) [07:48:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P86058 and previous config saved to /var/cache/conftool/dbconfig/20251128-074804-marostegui.json [07:54:40] arnaudb@cumin1003 arnaudb: The backup on gitlab1004 is complete, ready to proceed with upgrade. [07:55:51] gitlab will be unavailable for a few minutes, an upgrade is in progress [07:56:56] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11414906 (10MoritzMuehlenhoff) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251128T0800) [08:00:51] (03PS12) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) [08:01:10] PROBLEM - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is CRITICAL: connect to address gitlab.wikimedia.org and port 443: Connection refused https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [08:02:10] RECOVERY - Gitlab HTTPS healthcheck on gitlab.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 118785 bytes in 0.709 second response time https://wikitech.wikimedia.org/wiki/GitLab%23Monitoring [08:03:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195', diff saved to https://phabricator.wikimedia.org/P86059 and previous config saved to /var/cache/conftool/dbconfig/20251128-080312-marostegui.json [08:03:20] (03PS1) 10Brouberol: testkicthen: allow reaching out to the mpic app via testkitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212418 (https://phabricator.wikimedia.org/T407805) [08:03:22] (03PS1) 10Brouberol: testkitchen: add the additional testkitchen.w.o domain to the ingress gateway hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212419 (https://phabricator.wikimedia.org/T407805) [08:03:23] (03PS1) 10Brouberol: Define the testkitchen kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212427 (https://phabricator.wikimedia.org/T407805) [08:03:24] (03PS1) 10Brouberol: testkitchen-next: set the OIDC callback URL doimain to testkitchen-next.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212420 (https://phabricator.wikimedia.org/T407805) [08:03:24] (03PS1) 10Brouberol: Define the testkitchen services [puppet] - 10https://gerrit.wikimedia.org/r/1212428 (https://phabricator.wikimedia.org/T407805) [08:03:26] (03PS1) 10Brouberol: testkitchen: set the OIDC callback URL domain to testkitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212421 (https://phabricator.wikimedia.org/T407805) [08:03:27] (03PS1) 10Brouberol: testkitchen: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212429 (https://phabricator.wikimedia.org/T407805) [08:03:32] (03PS1) 10Brouberol: testkitchen: allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) [08:03:36] (03PS1) 10Brouberol: Rename mpic-next service to testkitchen-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212422 (https://phabricator.wikimedia.org/T407805) [08:03:40] (03PS1) 10Brouberol: testkitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212431 (https://phabricator.wikimedia.org/T407805) [08:03:44] (03PS1) 10Brouberol: testkitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [08:03:48] (03PS1) 10Brouberol: Rename mpic service to testkitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212423 (https://phabricator.wikimedia.org/T407805) [08:03:52] (03PS1) 10Brouberol: testkitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [08:03:56] (03PS1) 10Brouberol: testkitchen: drop the mpic.w.o SANs from the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212424 (https://phabricator.wikimedia.org/T407805) [08:04:00] (03PS1) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [08:04:04] (03PS1) 10Brouberol: testkitchen: drop the mpic.w.o domains from the ingress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212425 (https://phabricator.wikimedia.org/T407805) [08:04:08] (03PS1) 10Brouberol: Move mpic service mesh entry to testkitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [08:04:12] (03PS1) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) [08:04:16] (03PS1) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [08:05:18] !log arnaudb@cumin1003 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: Upgrade gitlab [08:05:25] (03CR) 10CI reject: [V:04-1] testkitchen: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212429 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:05:25] FIRING: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:57] (03CR) 10CI reject: [V:04-1] testkitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212431 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:08:57] !log installing Linux 6.1.158 kernel on Bookworm hosts [08:08:57] (03PS2) 10Brouberol: Define the testkitchen kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212427 (https://phabricator.wikimedia.org/T407805) [08:08:57] (03PS2) 10Brouberol: Define the testkitchen services [puppet] - 10https://gerrit.wikimedia.org/r/1212428 (https://phabricator.wikimedia.org/T407805) [08:08:57] (03PS2) 10Brouberol: testkitchen: allow public access from the internet [puppet] - 10https://gerrit.wikimedia.org/r/1212430 (https://phabricator.wikimedia.org/T407805) [08:08:57] (03PS2) 10Brouberol: testkitchen: drop mpic.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212432 (https://phabricator.wikimedia.org/T407805) [08:08:58] (03PS2) 10Brouberol: testkitchen: rename the OIDC services [puppet] - 10https://gerrit.wikimedia.org/r/1212433 (https://phabricator.wikimedia.org/T407805) [08:09:00] (03PS2) 10Brouberol: mpic: delete kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212434 (https://phabricator.wikimedia.org/T407805) [08:09:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:04] (03PS2) 10Brouberol: Move mpic service mesh entry to testkitchen [puppet] - 10https://gerrit.wikimedia.org/r/1212435 (https://phabricator.wikimedia.org/T407805) [08:09:08] (03PS2) 10Brouberol: mpic: delete services from service list [puppet] - 10https://gerrit.wikimedia.org/r/1212436 (https://phabricator.wikimedia.org/T407805) [08:09:12] (03PS1) 10Brouberol: testkitchen: reconfigure the OIDC service ids to support 2 domains [puppet] - 10https://gerrit.wikimedia.org/r/1212437 (https://phabricator.wikimedia.org/T407805) [08:09:16] (03PS1) 10Brouberol: testkitchen-next: drop mpic-next.w.o from OIDC service [puppet] - 10https://gerrit.wikimedia.org/r/1212438 (https://phabricator.wikimedia.org/T407805) [08:09:20] (03CR) 10CI reject: [V:04-1] testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:10:25] RESOLVED: SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:18:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1195 (T410531)', diff saved to https://phabricator.wikimedia.org/P86060 and previous config saved to /var/cache/conftool/dbconfig/20251128-081820-marostegui.json [08:18:26] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:18:36] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1196.eqiad.wmnet with reason: Maintenance [08:18:45] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1013,1017].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [08:18:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1196 (T410531)', diff saved to https://phabricator.wikimedia.org/P86061 and previous config saved to /var/cache/conftool/dbconfig/20251128-081852-marostegui.json [08:25:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T410531)', diff saved to https://phabricator.wikimedia.org/P86062 and previous config saved to /var/cache/conftool/dbconfig/20251128-082529-marostegui.json [08:25:35] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [08:27:02] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:37:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86063 and previous config saved to /var/cache/conftool/dbconfig/20251128-083755-marostegui.json [08:38:03] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [08:38:03] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [08:39:37] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [08:40:38] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P86064 and previous config saved to /var/cache/conftool/dbconfig/20251128-084037-marostegui.json [08:41:42] (03PS2) 10Brouberol: Define the testkitchen discovery and public recoords [dns] - 10https://gerrit.wikimedia.org/r/1212400 (https://phabricator.wikimedia.org/T407805) [08:41:50] (03PS3) 10Brouberol: Define the testkitchen discovery and public recoords [dns] - 10https://gerrit.wikimedia.org/r/1212400 (https://phabricator.wikimedia.org/T407805) [08:42:45] (03PS3) 10Bartosz Wójtowicz: ml-services: Separate eqiad and codfw deployments for Revise Tone. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) [08:49:03] (03CR) 10Santiago Faci: [C:03+1] Define the testkitchen discovery and public recoords [dns] - 10https://gerrit.wikimedia.org/r/1212400 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:49:13] (03CR) 10Brouberol: [C:03+2] Define the testkitchen discovery and public recoords [dns] - 10https://gerrit.wikimedia.org/r/1212400 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:49:26] !log brouberol@dns1004 START - running authdns-update [08:50:38] !log brouberol@dns1004 END - running authdns-update [08:51:52] (03CR) 10Brouberol: [C:03+2] Define the testkitchen kubeconfigs [puppet] - 10https://gerrit.wikimedia.org/r/1212427 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:51:55] (03CR) 10Brouberol: [C:03+2] Define the testkitchen services [puppet] - 10https://gerrit.wikimedia.org/r/1212428 (https://phabricator.wikimedia.org/T407805) (owner: 10Brouberol) [08:53:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86065 and previous config saved to /var/cache/conftool/dbconfig/20251128-085303-marostegui.json [08:55:45] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P86066 and previous config saved to /var/cache/conftool/dbconfig/20251128-085544-marostegui.json [08:59:14] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: Upgrade and reboot [08:59:16] (03PS2) 10Brouberol: testkicthen: allow reaching out to the mpic app via testkitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212418 (https://phabricator.wikimedia.org/T407805) [08:59:17] (03PS2) 10Brouberol: testkitchen: add the additional testkitchen.w.o domain to the ingress gateway hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212419 (https://phabricator.wikimedia.org/T407805) [08:59:19] (03PS2) 10Brouberol: testkitchen-next: set the OIDC callback URL doimain to testkitchen-next.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212420 (https://phabricator.wikimedia.org/T407805) [08:59:21] (03PS2) 10Brouberol: testkitchen: set the OIDC callback URL domain to testkitchen.w.o [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212421 (https://phabricator.wikimedia.org/T407805) [08:59:23] (03PS2) 10Brouberol: Rename mpic-next service to testkitchen-next [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212422 (https://phabricator.wikimedia.org/T407805) [08:59:25] (03PS2) 10Brouberol: Rename mpic service to testkitchen [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212423 (https://phabricator.wikimedia.org/T407805) [08:59:27] (03PS2) 10Brouberol: testkitchen: drop the mpic.w.o SANs from the certificate [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212424 (https://phabricator.wikimedia.org/T407805) [08:59:30] (03PS2) 10Brouberol: testkitchen: drop the mpic.w.o domains from the ingress gateways [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212425 (https://phabricator.wikimedia.org/T407805) [08:59:34] (03PS2) 10Brouberol: testkitchen: rename the OIDC services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212426 (https://phabricator.wikimedia.org/T407805) [08:59:38] (03CR) 10Bartosz Wójtowicz: ml-services: Separate eqiad and codfw deployments for Revise Tone. (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:08:11] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P86067 and previous config saved to /var/cache/conftool/dbconfig/20251128-090810-marostegui.json [09:08:17] FIRING: [10x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T410531)', diff saved to https://phabricator.wikimedia.org/P86068 and previous config saved to /var/cache/conftool/dbconfig/20251128-091052-marostegui.json [09:10:58] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:11:09] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1206.eqiad.wmnet with reason: Maintenance [09:11:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1206 (T410531)', diff saved to https://phabricator.wikimedia.org/P86069 and previous config saved to /var/cache/conftool/dbconfig/20251128-091116-marostegui.json [09:15:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:17:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T410531)', diff saved to https://phabricator.wikimedia.org/P86070 and previous config saved to /var/cache/conftool/dbconfig/20251128-091712-marostegui.json [09:17:18] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [09:18:17] FIRING: [15x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:18:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [09:21:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:23:17] FIRING: [20x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:23:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86071 and previous config saved to /var/cache/conftool/dbconfig/20251128-092318-marostegui.json [09:23:25] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [09:23:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [09:23:34] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2158.codfw.wmnet with reason: Maintenance [09:23:42] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2158 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86072 and previous config saved to /var/cache/conftool/dbconfig/20251128-092341-marostegui.json [09:26:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:26:47] (03CR) 10Klausman: [C:03+1] Remove GPU settings from ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1211682 (https://phabricator.wikimedia.org/T411082) (owner: 10Elukey) [09:26:50] (03CR) 10Klausman: [C:03+2] Remove GPU settings from ml-serve1001 [puppet] - 10https://gerrit.wikimedia.org/r/1211682 (https://phabricator.wikimedia.org/T411082) (owner: 10Elukey) [09:27:10] !log klausman@cumin1003 START - Cookbook sre.hosts.reimage for host ml-serve1001.eqiad.wmnet with OS trixie [09:27:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team, 13Patch-For-Review: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11415058 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by klausman@cumin1003 for host ml-serve1001.eqiad.wmnet with OS trixie [09:28:17] FIRING: [19x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:28:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [09:30:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:32:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P86073 and previous config saved to /var/cache/conftool/dbconfig/20251128-093219-marostegui.json [09:33:17] FIRING: [17x] ProbeDown: Service wdqs2007:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:33:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:35:50] (03CR) 10Klausman: [C:03+1] ml-services: Separate eqiad and codfw deployments for Revise Tone. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [09:35:58] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:36:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:38:17] FIRING: [15x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:39:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:43:17] FIRING: [10x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:44:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:44:47] !log klausman@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [09:44:59] (03PS1) 10Muehlenhoff: Add reuse Partman config for EFIfied DBs [puppet] - 10https://gerrit.wikimedia.org/r/1212528 (https://phabricator.wikimedia.org/T410400) [09:46:02] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2012:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:47:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P86074 and previous config saved to /var/cache/conftool/dbconfig/20251128-094727-marostegui.json [09:48:07] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve1001.eqiad.wmnet with reason: host reimage [09:54:22] (03PS1) 10Muehlenhoff: Remove udp2log-users [puppet] - 10https://gerrit.wikimedia.org/r/1212529 [09:55:53] (03PS1) 10Muehlenhoff: Replace Leo as group approver with Hugh [puppet] - 10https://gerrit.wikimedia.org/r/1212530 [09:58:48] (03CR) 10Bartosz Wójtowicz: [C:03+2] ml-services: Separate eqiad and codfw deployments for Revise Tone. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:00:36] (03Merged) 10jenkins-bot: ml-services: Separate eqiad and codfw deployments for Revise Tone. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211640 (https://phabricator.wikimedia.org/T408538) (owner: 10Bartosz Wójtowicz) [10:02:19] !log bwojtowicz@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [10:02:35] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T410531)', diff saved to https://phabricator.wikimedia.org/P86075 and previous config saved to /var/cache/conftool/dbconfig/20251128-100234-marostegui.json [10:02:41] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:02:51] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1218.eqiad.wmnet with reason: Maintenance [10:02:59] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1218 (T410531)', diff saved to https://phabricator.wikimedia.org/P86076 and previous config saved to /var/cache/conftool/dbconfig/20251128-100258-marostegui.json [10:04:50] !log klausman@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve1001.eqiad.wmnet with OS trixie [10:04:58] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11415142 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by klausman@cumin1003 for host ml-serve1001.eqiad.wmnet with OS trixie completed: - ml-serve1001... [10:05:56] !log bwojtowicz@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revise-tone-task-generator' for release 'main' . [10:06:07] (03PS2) 10Muehlenhoff: Add reuse Partman config for EFIfied DBs [puppet] - 10https://gerrit.wikimedia.org/r/1212528 (https://phabricator.wikimedia.org/T410400) [10:07:04] (03PS2) 10Ayounsi: re-add pc* to the clusters with no AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1212110 (https://phabricator.wikimedia.org/T253173) [10:08:31] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Remove old GPUs from ml-serve1001 - https://phabricator.wikimedia.org/T411082#11415149 (10klausman) 05Open→03Resolved Machine has been reimaged and is back in the cluster, closing. [10:09:06] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T410531)', diff saved to https://phabricator.wikimedia.org/P86077 and previous config saved to /var/cache/conftool/dbconfig/20251128-100905-marostegui.json [10:09:11] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:17:14] (03CR) 10Ayounsi: [C:03+2] re-add pc* to the clusters with no AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1212110 (https://phabricator.wikimedia.org/T253173) (owner: 10Ayounsi) [10:17:59] (03CR) 10Ayounsi: [C:03+2] "Merging that change for PC hosts only, restbase requires a longer conversation as detailed in https://phabricator.wikimedia.org/T271140#11" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1212110 (https://phabricator.wikimedia.org/T253173) (owner: 10Ayounsi) [10:18:59] FIRING: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:19:18] (03Merged) 10jenkins-bot: re-add pc* to the clusters with no AAAA [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1212110 (https://phabricator.wikimedia.org/T253173) (owner: 10Ayounsi) [10:21:28] (03PS1) 10Bartosz Wójtowicz: ml-services: Update image_version for revise-tone-task-generator. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212536 (https://phabricator.wikimedia.org/T408538) [10:24:02] (03CR) 10Awight: "@cwhite@wikimedia.org FYI, the job is running for the next 10 hours or so, you could use this opportunity to test and deploy the Prometheu" [puppet] - 10https://gerrit.wikimedia.org/r/1207174 (https://phabricator.wikimedia.org/T402613) (owner: 10Awight) [10:24:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P86078 and previous config saved to /var/cache/conftool/dbconfig/20251128-102412-marostegui.json [10:26:03] !log ayounsi@cumin1003 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [10:26:35] !log ayounsi@cumin1003 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [10:28:59] RESOLVED: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:08] (03PS1) 10Kevin Bazira: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212544 (https://phabricator.wikimedia.org/T410906) [10:35:56] (03PS3) 10Muehlenhoff: Add reuse Partman config for EFIfied DBs [puppet] - 10https://gerrit.wikimedia.org/r/1212528 (https://phabricator.wikimedia.org/T410400) [10:36:51] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:37:41] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:38:10] (03CR) 10Blake: [C:03+2] alerting: Add an alert for when Kafka brokers need a rolling restart. [alerts] - 10https://gerrit.wikimedia.org/r/1212113 (owner: 10Blake) [10:38:30] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] START helmfile.d/admin 'apply'. [10:39:03] !log brouberol@deploy2002 helmfile [dse-k8s-codfw] DONE helmfile.d/admin 'apply'. [10:39:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218', diff saved to https://phabricator.wikimedia.org/P86079 and previous config saved to /var/cache/conftool/dbconfig/20251128-103920-marostegui.json [10:39:44] (03Merged) 10jenkins-bot: alerting: Add an alert for when Kafka brokers need a rolling restart. [alerts] - 10https://gerrit.wikimedia.org/r/1212113 (owner: 10Blake) [10:43:34] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11415240 (10MoritzMuehlenhoff) [10:44:25] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11415242 (10MoritzMuehlenhoff) [10:44:58] (03CR) 10Muehlenhoff: [C:03+2] Add reuse Partman config for EFIfied DBs [puppet] - 10https://gerrit.wikimedia.org/r/1212528 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff) [10:50:11] (03CR) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes (037 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) (owner: 10Matthieulec) [10:53:32] (03CR) 10Dpogorzelski: [C:03+1] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212544 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [10:54:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1218 (T410531)', diff saved to https://phabricator.wikimedia.org/P86080 and previous config saved to /var/cache/conftool/dbconfig/20251128-105427-marostegui.json [10:54:34] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [10:54:37] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [10:54:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1219.eqiad.wmnet with reason: Maintenance [10:54:52] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1219 (T410531)', diff saved to https://phabricator.wikimedia.org/P86081 and previous config saved to /var/cache/conftool/dbconfig/20251128-105451-marostegui.json [10:57:40] !log jynus@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: Upgrade and reboot [10:58:59] FIRING: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:00:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T410531)', diff saved to https://phabricator.wikimedia.org/P86082 and previous config saved to /var/cache/conftool/dbconfig/20251128-110052-marostegui.json [11:00:59] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:03:59] RESOLVED: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:12:41] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212544 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [11:14:42] (03Merged) 10jenkins-bot: ml-services: update llm model-server image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212544 (https://phabricator.wikimedia.org/T410906) (owner: 10Kevin Bazira) [11:15:44] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'llm' for release 'main' . [11:16:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P86083 and previous config saved to /var/cache/conftool/dbconfig/20251128-111600-marostegui.json [11:20:49] (03PS13) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) [11:24:42] (03CR) 10Sergio Gimeno: [C:03+1] "The dependency is now merged and this can deployed" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1207133 (https://phabricator.wikimedia.org/T407431) (owner: 10Cyndywikime) [11:28:59] FIRING: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:29:37] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [11:31:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219', diff saved to https://phabricator.wikimedia.org/P86084 and previous config saved to /var/cache/conftool/dbconfig/20251128-113107-marostegui.json [11:31:18] (03Abandoned) 10Ladsgroup: admin: Add neslihanturan to restricted [puppet] - 10https://gerrit.wikimedia.org/r/1196894 (https://phabricator.wikimedia.org/T406590) (owner: 10Ladsgroup) [11:33:59] RESOLVED: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:42:58] (03PS1) 10Volans: wmcs infra-tracing: optimize Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) [11:43:19] (03PS1) 10Daniel Kinzler: api gateway: add CDCN headersw to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 [11:43:27] (03CR) 10CI reject: [V:04-1] wmcs infra-tracing: optimize Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:43:54] (03PS2) 10Daniel Kinzler: api gateway: add CDN headers to access log [deployment-charts] - 10https://gerrit.wikimedia.org/r/1212560 [11:43:59] FIRING: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:44:11] (03PS2) 10Volans: wmcs infra-tracing: optimize Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) [11:45:52] (03CR) 10CI reject: [V:04-1] wmcs infra-tracing: optimize Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:46:16] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1219 (T410531)', diff saved to https://phabricator.wikimedia.org/P86085 and previous config saved to /var/cache/conftool/dbconfig/20251128-114615-marostegui.json [11:46:21] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:46:33] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1232.eqiad.wmnet with reason: Maintenance [11:46:41] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1232 (T410531)', diff saved to https://phabricator.wikimedia.org/P86086 and previous config saved to /var/cache/conftool/dbconfig/20251128-114640-marostegui.json [11:47:12] (03PS3) 10Volans: wmcs infra-tracing: optimize Loki indexing [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) [11:50:02] (03CR) 10Volans: "Python script tested on a toolsbseta's nfs worker." [puppet] - 10https://gerrit.wikimedia.org/r/1212559 (https://phabricator.wikimedia.org/T399313) (owner: 10Volans) [11:52:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T410531)', diff saved to https://phabricator.wikimedia.org/P86087 and previous config saved to /var/cache/conftool/dbconfig/20251128-115238-marostegui.json [11:52:45] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [11:53:40] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T410589)', diff saved to https://phabricator.wikimedia.org/P86088 and previous config saved to /var/cache/conftool/dbconfig/20251128-115340-ladsgroup.json [11:53:46] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [11:58:19] (03PS4) 10Blake: alertmanager: Add a Phabricator receiver for serviceops. [puppet] - 10https://gerrit.wikimedia.org/r/1212554 (https://phabricator.wikimedia.org/T410552) [12:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251128T0800) [12:00:05] jelto, arnoldokoth, and mutante: #bothumor My software never has bugs. It just develops random features. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20251128T1200). [12:01:02] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11415521 (10MLechvien-WMF) The new --rack argument will be mutually exclusive with the hosts query. The following changes have also been fa... [12:06:29] (03PS1) 10Muehlenhoff: Test reuse workflow on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1212563 (https://phabricator.wikimedia.org/T410400) [12:07:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P86089 and previous config saved to /var/cache/conftool/dbconfig/20251128-120746-marostegui.json [12:08:01] (03CR) 10Alexandros Kosiaris: [C:03+1] alertmanager: Add a Phabricator receiver for serviceops. [puppet] - 10https://gerrit.wikimedia.org/r/1212554 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [12:08:47] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P86090 and previous config saved to /var/cache/conftool/dbconfig/20251128-120847-ladsgroup.json [12:10:00] (03CR) 10Blake: [C:03+2] alertmanager: Add a Phabricator receiver for serviceops. [puppet] - 10https://gerrit.wikimedia.org/r/1212554 (https://phabricator.wikimedia.org/T410552) (owner: 10Blake) [12:15:17] (03CR) 10Daniel Kinzler: api-gateway: Rest-gateway Read `ratelimit_class` and `user_id` from JWT (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1192579 (https://phabricator.wikimedia.org/T405578) (owner: 10Pmiazga) [12:22:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P86091 and previous config saved to /var/cache/conftool/dbconfig/20251128-122253-marostegui.json [12:23:55] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P86092 and previous config saved to /var/cache/conftool/dbconfig/20251128-122354-ladsgroup.json [12:24:55] (03CR) 10Muehlenhoff: [C:03+2] Test reuse workflow on db1169 [puppet] - 10https://gerrit.wikimedia.org/r/1212563 (https://phabricator.wikimedia.org/T410400) (owner: 10Muehlenhoff) [12:27:44] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11415594 (10Jclark-ctr) a:03Jclark-ctr service request 219355025 [12:29:59] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:30:32] (03PS9) 10Daniel Kinzler: rest-gateway: assign ratelimit class by network range [deployment-charts] - 10https://gerrit.wikimedia.org/r/1206956 (https://phabricator.wikimedia.org/T410273) [12:34:05] 10ops-eqiad, 06SRE, 06DC-Ops, 06Data-Platform-SRE (2025.11.07 - 2025.11.28): Degraded RAID on an-worker1191 - https://phabricator.wikimedia.org/T411209#11415631 (10Jclark-ctr) @BTullis parts should arrive Monday. they are shipping 2x drives [12:34:58] (03PS1) 10Majavah: Add former Toki Pona language codes [dns] - 10https://gerrit.wikimedia.org/r/1212577 (https://phabricator.wikimedia.org/T404507) [12:36:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [12:38:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T410531)', diff saved to https://phabricator.wikimedia.org/P86093 and previous config saved to /var/cache/conftool/dbconfig/20251128-123801-marostegui.json [12:38:08] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:38:18] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1234.eqiad.wmnet with reason: Maintenance [12:38:26] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1234 (T410531)', diff saved to https://phabricator.wikimedia.org/P86094 and previous config saved to /var/cache/conftool/dbconfig/20251128-123825-marostegui.json [12:38:35] (03PS1) 10Majavah: mediawiki: Add redirects for old Toki Pona aliases [puppet] - 10https://gerrit.wikimedia.org/r/1212578 (https://phabricator.wikimedia.org/T404507) [12:39:02] !log ladsgroup@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T410589)', diff saved to https://phabricator.wikimedia.org/P86095 and previous config saved to /var/cache/conftool/dbconfig/20251128-123902-ladsgroup.json [12:39:08] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11415666 (10MLechvien-WMF) With test-cookbook in dryrun @Raine and I tested following 6 test cases: - In cluster wikikube-eqiad, action c... [12:39:08] T410589: Optimize all core tables, late 2025 - https://phabricator.wikimedia.org/T410589 [12:39:19] !log ladsgroup@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1239.eqiad.wmnet with reason: Maintenance [12:39:59] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [12:43:02] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:44:21] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T410531)', diff saved to https://phabricator.wikimedia.org/P86096 and previous config saved to /var/cache/conftool/dbconfig/20251128-124420-marostegui.json [12:44:27] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [12:45:18] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11415702 (10MoritzMuehlenhoff) [12:47:19] (03PS14) 10Matthieulec: sre.k8s.pool-depool-node: Adding a --rack flag for more intuitive operations, and more validations to avoid mistakes [cookbooks] - 10https://gerrit.wikimedia.org/r/1212089 (https://phabricator.wikimedia.org/T410537) [12:48:02] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:49:47] (03PS1) 10Dpogorzelski: dpogorzelski: add yk backend prod key [puppet] - 10https://gerrit.wikimedia.org/r/1212581 [12:53:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:54:59] FIRING: [10x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:13] (03CR) 10Muehlenhoff: [C:03+1] "Looks good and verified out of band" [puppet] - 10https://gerrit.wikimedia.org/r/1212581 (owner: 10Dpogorzelski) [12:57:34] (03CR) 10Dpogorzelski: [C:03+2] dpogorzelski: add yk backend prod key [puppet] - 10https://gerrit.wikimedia.org/r/1212581 (owner: 10Dpogorzelski) [12:59:29] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P86097 and previous config saved to /var/cache/conftool/dbconfig/20251128-125928-marostegui.json [12:59:59] FIRING: [18x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:03] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11415797 (10Jclark-ctr) Supermicro has been slow to respond to case updates, usually taking 2–3 days between replies, while I have been responding the same day. They finally agreed to ship the replacement pa... [13:02:30] (03CR) 10Kamila Součková: [C:03+1] mediawiki: Add redirects for old Toki Pona aliases [puppet] - 10https://gerrit.wikimedia.org/r/1212578 (https://phabricator.wikimedia.org/T404507) (owner: 10Majavah) [13:03:27] 10SRE-tools, 06Infrastructure-Foundations, 06serviceops, 13Patch-For-Review: Add a --rack flag to sre.k8s.pool-depool-node - https://phabricator.wikimedia.org/T410537#11415802 (10MLechvien-WMF) One more test case: a rack that does not exist in the cluster: `test-cookbook -c 1212089 --dry-run sre.k8s.pool-d... [13:03:49] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11415804 (10MoritzMuehlenhoff) >>! In T410743#11415797, @Jclark-ctr wrote: > Supermicro has been slow to respond to case updates, usually taking 2–3 days between replies, while I have been responding the sam... [13:04:41] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1039 - https://phabricator.wikimedia.org/T410743#11415810 (10Jclark-ctr) I have CC Willy on email chain additionally [13:04:59] FIRING: [18x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:05:35] (03CR) 10Kamila Součková: [C:03+1] Add former Toki Pona language codes [dns] - 10https://gerrit.wikimedia.org/r/1212577 (https://phabricator.wikimedia.org/T404507) (owner: 10Majavah) [13:08:05] 06SRE, 10Math: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329#11415822 (10SLyngshede-WMF) [13:08:43] 06SRE, 10Math: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329#11415824 (10SLyngshede-WMF) I'm removing Traffic, we're not going to find this two years later. [13:11:21] (03PS1) 10Mszwarc: Fix mw-userlink class being added too broadly [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212584 (https://phabricator.wikimedia.org/T392775) [13:12:27] arnoldokoth, moritzm, jnuche: per discussion in _security we’ll probably want to do an emergency deploy to revert https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1212204 (to fix the errors mentioned at T410696#11415793) [13:12:28] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [13:12:51] I can do the deploy if needed but I wouldn’t mind if someone else does it, I’m supposed to be preparing for interviewing someone ^^ [13:13:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2007:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:14:36] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P86098 and previous config saved to /var/cache/conftool/dbconfig/20251128-131435-marostegui.json [13:17:36] Lucas_WMDE: responded in _security. Ok with emergency deploy [13:18:11] (03PS1) 10Ladsgroup: Revert "Deploy 2025 Global Readers Survey (non-enwiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212585 [13:18:19] I deploy it [13:18:31] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Deploy 2025 Global Readers Survey (non-enwiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212585 (https://phabricator.wikimedia.org/T410696) (owner: 10Ladsgroup) [13:18:50] ack, I just quickly added the Bug: to the commit message [13:18:54] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Revert "Deploy 2025 Global Readers Survey (non-enwiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212585 (https://phabricator.wikimedia.org/T410696) (owner: 10Ladsgroup) [13:19:24] (I was holding off for a bit due to https://phabricator.wikimedia.org/T410696#11415851, but on the other hand if the coverage is 0 it sounds like reverting the config change is safe to do anyway) [13:20:08] Available to help if anything unexpected happens. [13:20:17] same [13:20:29] Amir1: FWIW, https://de.wikipedia.org/wiki/MediaWiki:Reader-demographics-2025-de-survey-link exists now and the -de errors appear to have stopped in logstash [13:20:50] (but it’s still fine to continue deploying IMHO) [13:21:09] Yeah, it's a test survey it seems, it's better to revert IMHO [13:21:27] https://phabricator.wikimedia.org/T410696#11415851 so it won't break any user feature [13:21:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212585 (https://phabricator.wikimedia.org/T410696) (owner: 10Ladsgroup) [13:22:01] yup [13:22:46] (03Merged) 10jenkins-bot: Revert "Deploy 2025 Global Readers Survey (non-enwiki)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212585 (https://phabricator.wikimedia.org/T410696) (owner: 10Ladsgroup) [13:23:08] !log ladsgroup@deploy2002 Started scap sync-world: Backport for [[gerrit:1212585|Revert "Deploy 2025 Global Readers Survey (non-enwiki)" (T410696)]] [13:23:14] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [13:25:11] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1212585|Revert "Deploy 2025 Global Readers Survey (non-enwiki)" (T410696)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [13:25:31] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [13:25:50] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS trixie [13:26:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11415871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1032.eqiad.wmnet with OS trixie [13:26:22] Amir1: also, if you used SpiderPig then we could all spy on the progress of your scap 👉👈 🥺 [13:29:04] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:29:04] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-query_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:29:19] Lucas_WMDE: I really should, just too lazy [13:29:38] !log ladsgroup@deploy2002 Finished scap sync-world: Backport for [[gerrit:1212585|Revert "Deploy 2025 Global Readers Survey (non-enwiki)" (T410696)]] (duration: 06m 30s) [13:29:43] T410696: Deploy enwiki edition of 2025 GRS - https://phabricator.wikimedia.org/T410696 [13:29:44] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T410531)', diff saved to https://phabricator.wikimedia.org/P86099 and previous config saved to /var/cache/conftool/dbconfig/20251128-132943-marostegui.json [13:29:49] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:29:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [13:29:59] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1235.eqiad.wmnet with reason: Maintenance [13:30:04] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:04] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [13:30:07] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1235 (T410531)', diff saved to https://phabricator.wikimedia.org/P86100 and previous config saved to /var/cache/conftool/dbconfig/20251128-133006-marostegui.json [13:30:25] logstash is looking much better to me now, yay [13:31:00] https://grafana.wikimedia.org/d/190fe8e9-70fc-498b-83f7-fd237f6c53ae/mediawiki-square-one-mw-web?orgId=1&from=now-1h&to=now&timezone=utc&var-site=eqiad&var-site=codfw&var-deployment=mw-web&var-percentile=50&var-cron=.%2A&viewPanel=panel-19 [13:32:29] thanks for deploying! [13:36:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T410531)', diff saved to https://phabricator.wikimedia.org/P86101 and previous config saved to /var/cache/conftool/dbconfig/20251128-133610-marostegui.json [13:36:18] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [13:36:27] (deployment on fridays?) [13:36:42] yes, emergency deployment [13:37:30] tappof: FYI, the pybal error above was Thanos getting OOM-killed on titan1001 [13:38:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:40:18] 06SRE, 10Math: Determine the cause of x8 increase in requests to math endpoints between july 6 and August 3 2023 - https://phabricator.wikimedia.org/T344329#11415922 (10Physikerwelt) 05Open→03Declined [13:42:03] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86102 and previous config saved to /var/cache/conftool/dbconfig/20251128-134202-marostegui.json [13:42:10] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [13:42:11] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [13:43:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:47:08] (03CR) 10Pppery: [C:03+1] mediawiki: Add redirects for old Toki Pona aliases [puppet] - 10https://gerrit.wikimedia.org/r/1212578 (https://phabricator.wikimedia.org/T404507) (owner: 10Majavah) [13:48:02] FIRING: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:49:50] (03CR) 10Daniel Kinzler: api-gateway chart: add values-rest-staging.yaml (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1211656 (owner: 10Daniel Kinzler) [13:51:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P86103 and previous config saved to /var/cache/conftool/dbconfig/20251128-135119-marostegui.json [13:51:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [13:52:20] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS bookworm [13:52:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11415948 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm [13:53:02] RESOLVED: [3x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2008:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [13:56:08] !log hashar@deploy2002 Started deploy [integration/docroot@607a959]: build: Updating eslint-config-wikimedia to 0.32.2 [13:56:19] !log hashar@deploy2002 Finished deploy [integration/docroot@607a959]: build: Updating eslint-config-wikimedia to 0.32.2 (duration: 00m 11s) [13:57:10] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86104 and previous config saved to /var/cache/conftool/dbconfig/20251128-135710-marostegui.json [14:04:59] RESOLVED: JobUnavailable: Reduced availability for job thanos-query-frontend in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:06:27] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P86105 and previous config saved to /var/cache/conftool/dbconfig/20251128-140626-marostegui.json [14:12:18] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P86106 and previous config saved to /var/cache/conftool/dbconfig/20251128-141217-marostegui.json [14:19:54] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416033 (10Jclark-ctr) @RKemper When i look up the file in the failed it still list wdqs servers not in regex format http://apt.wikimedia.org/autoinstall/presee... [14:20:36] (03CR) 10Cathal Mooney: "Nice! Thanks so much for the patch, looks good to me let's see what Luca thinks he is more versed in Python than us netops :)" [software/homer] - 10https://gerrit.wikimedia.org/r/1212243 (owner: 10E75ti) [14:21:34] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T410531)', diff saved to https://phabricator.wikimedia.org/P86107 and previous config saved to /var/cache/conftool/dbconfig/20251128-142133-marostegui.json [14:21:39] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1239.eqiad.wmnet with reason: Maintenance [14:21:39] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:26:08] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1240.eqiad.wmnet with reason: Maintenance [14:27:18] (03CR) 10Elukey: [C:03+1] capirca: python 3.12 deprecates datetime.utcnow() [software/homer] - 10https://gerrit.wikimedia.org/r/1212243 (owner: 10E75ti) [14:27:25] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86108 and previous config saved to /var/cache/conftool/dbconfig/20251128-142725-marostegui.json [14:27:33] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [14:27:33] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [14:27:41] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2169.codfw.wmnet with reason: Maintenance [14:27:49] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86109 and previous config saved to /var/cache/conftool/dbconfig/20251128-142748-marostegui.json [14:30:22] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1251.eqiad.wmnet with reason: Maintenance [14:30:30] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db1251 (T410531)', diff saved to https://phabricator.wikimedia.org/P86110 and previous config saved to /var/cache/conftool/dbconfig/20251128-143029-marostegui.json [14:30:35] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:31:22] 06SRE, 10envoy, 06serviceops: Upgrade Envoy to v1.35.6 - https://phabricator.wikimedia.org/T410975#11416093 (10hashar) I have updated the [[ https://integration.wikimedia.org/ci/job/helm-lint/ | helm-lint jenkins job ]]. [14:36:31] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T410531)', diff saved to https://phabricator.wikimedia.org/P86111 and previous config saved to /var/cache/conftool/dbconfig/20251128-143631-marostegui.json [14:36:37] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [14:46:10] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q2): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273 (10tappof) 03NEW [14:51:39] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P86112 and previous config saved to /var/cache/conftool/dbconfig/20251128-145138-marostegui.json [14:51:49] (03PS1) 10CDobbins: sre.loadbalancer: patch to fix reboot action [cookbooks] - 10https://gerrit.wikimedia.org/r/1212596 (https://phabricator.wikimedia.org/T395240) [14:54:31] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212584 (https://phabricator.wikimedia.org/T392775) (owner: 10Mszwarc) [14:54:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [14:55:16] 10SRE-SLO, 10Observability-Metrics, 06SRE Observability (FY2025/2026-Q2): Thanos (store|query-frontend) memcached cache in bad status - https://phabricator.wikimedia.org/T411273#11416219 (10tappof) [15:06:47] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251', diff saved to https://phabricator.wikimedia.org/P86113 and previous config saved to /var/cache/conftool/dbconfig/20251128-150646-marostegui.json [15:09:59] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:21:54] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db1251 (T410531)', diff saved to https://phabricator.wikimedia.org/P86114 and previous config saved to /var/cache/conftool/dbconfig/20251128-152153-marostegui.json [15:22:00] T410531: Drop rc_type from recentchanges in wmf production - https://phabricator.wikimedia.org/T410531 [15:22:10] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [15:29:59] (03PS1) 10Blake: alerting: Update severity of KafkaRollingRestartRequired to Task. [alerts] - 10https://gerrit.wikimedia.org/r/1212599 (https://phabricator.wikimedia.org/T410552) [15:29:59] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:34:59] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:29] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.12 point update - https://phabricator.wikimedia.org/T403852#11416280 (10MoritzMuehlenhoff) [16:17:01] !log Added 100 GB to /srv LV on titan1001/1002/2002 (T410152) [16:17:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:07] T410152: Disk space saturation (/srv) on Titan hosts - https://phabricator.wikimedia.org/T410152 [16:30:10] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [16:42:11] (03PS1) 10Jclark-ctr: update preseed file for wdqs1028-wdqs1032 [puppet] - 10https://gerrit.wikimedia.org/r/1212609 (https://phabricator.wikimedia.org/T410406) [16:43:36] (03CR) 10Jclark-ctr: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1212609 (https://phabricator.wikimedia.org/T410406) (owner: 10Jclark-ctr) [16:44:13] (03CR) 10Jclark-ctr: [C:03+2] update preseed file for wdqs1028-wdqs1032 [puppet] - 10https://gerrit.wikimedia.org/r/1212609 (https://phabricator.wikimedia.org/T410406) (owner: 10Jclark-ctr) [16:51:28] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1212609 (https://phabricator.wikimedia.org/T410406) (owner: 10Jclark-ctr) [16:51:30] (03CR) 10Cathal Mooney: [C:03+2] update preseed file for wdqs1028-wdqs1032 [puppet] - 10https://gerrit.wikimedia.org/r/1212609 (https://phabricator.wikimedia.org/T410406) (owner: 10Jclark-ctr) [17:04:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS trixie [17:04:04] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1029.eqiad.wmnet with OS bookworm [17:04:19] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416402 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS trixie [17:04:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1029.eqiad.wmnet with OS bookworm [17:04:59] FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:13:02] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1030.eqiad.wmnet with OS bookworm [17:13:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416439 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1030.eqiad.wmnet with OS bookworm [17:16:40] (03PS1) 10Bartosz Dziewoński: Api: Initialise reference variable [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212611 (https://phabricator.wikimedia.org/T411075) [17:16:52] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, December 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [core] (wmf/1.46.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1212611 (https://phabricator.wikimedia.org/T411075) (owner: 10Bartosz Dziewoński) [17:18:00] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1032.eqiad.wmnet with OS bookworm [17:18:20] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1031.eqiad.wmnet with OS bookworm [17:18:25] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1032.eqiad.wmnet with OS bookworm [17:18:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 4 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416447 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1031.eqiad.wmnet with OS bookworm [17:22:57] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [17:28:58] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1029.eqiad.wmnet with reason: host reimage [17:29:59] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [17:32:29] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [17:33:58] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416478 (10Jclark-ctr) I am unsure why it was failing to see the names in preseed.yaml but after separating them they seem to be imaging now some still seem to fail... [17:37:44] !log jclark@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [17:39:14] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1030.eqiad.wmnet with reason: host reimage [17:43:32] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wdqs1031.eqiad.wmnet with reason: host reimage [17:48:27] !log jclark@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:49:27] !log jclark@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1003" [17:49:28] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1029.eqiad.wmnet with OS bookworm [17:49:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416511 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1029.eqiad.wmnet with OS bookworm completed: - wdqs1029... [17:58:35] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1030.eqiad.wmnet with OS bookworm [17:58:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416516 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1030.eqiad.wmnet with OS bookworm completed: - wdqs1030... [18:02:50] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wdqs1031.eqiad.wmnet with OS bookworm [18:03:10] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416520 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1031.eqiad.wmnet with OS bookworm completed: - wdqs1031... [18:04:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416521 (10Jclark-ctr) wdqs1029 , wdqs1030, wdqs1031. have finished. wdqs1028 and wdqs1032 are having issues still http://apt.wikimedia.org/autoinstall/presee... [18:11:44] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:14:15] !log jclark@cumin1003 START - Cookbook sre.hosts.provision for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:14:42] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1028.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:16:20] !log jclark@cumin1003 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wdqs1032.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [18:20:59] !log jclark@cumin1003 START - Cookbook sre.hosts.reimage for host wdqs1028.eqiad.wmnet with OS bookworm [18:21:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416525 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm [18:38:15] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1032.eqiad.wmnet with OS bookworm [18:38:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416534 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1032.eqiad.wmnet with OS bookworm executed with errors:... [18:42:56] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86116 and previous config saved to /var/cache/conftool/dbconfig/20251128-184256-marostegui.json [18:43:03] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [18:43:04] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [18:44:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:49:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [18:54:59] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:57:43] (03PS1) 10Mforns: analytics::refinery::job::data_purge: Add drop-ja3n-ua-hourly job [puppet] - 10https://gerrit.wikimedia.org/r/1212626 (https://phabricator.wikimedia.org/T409584) [18:58:04] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86117 and previous config saved to /var/cache/conftool/dbconfig/20251128-185803-marostegui.json [19:13:12] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169', diff saved to https://phabricator.wikimedia.org/P86118 and previous config saved to /var/cache/conftool/dbconfig/20251128-191311-marostegui.json [19:28:19] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2169 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86119 and previous config saved to /var/cache/conftool/dbconfig/20251128-192818-marostegui.json [19:28:26] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [19:28:26] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [19:28:35] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2180.codfw.wmnet with reason: Maintenance [19:28:43] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2180 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86120 and previous config saved to /var/cache/conftool/dbconfig/20251128-192843-marostegui.json [19:30:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [19:37:20] FIRING: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:41:15] !log jclark@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs1028.eqiad.wmnet with OS bookworm [19:41:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Wikidata, and 3 others: Racking request for wdqs10(2[8-9]|3[0-2]) - https://phabricator.wikimedia.org/T410406#11416624 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1003 for host wdqs1028.eqiad.wmnet with OS bookworm executed with errors:... [19:42:20] RESOLVED: CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@eqiad to dnsdisc) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [20:30:10] FIRING: HelmReleaseBadStatus: Helm release mw-script/utk6lsuw on k8s@codfw in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-script - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [20:30:46] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86121 and previous config saved to /var/cache/conftool/dbconfig/20251128-203045-marostegui.json [20:30:53] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [20:30:53] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [20:39:51] FIRING: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [20:39:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [20:40:10] FIRING: [2x] ConfdResourceFailed: confd resource _srv_config-master_pybal_eqiad_druid-public-coordinator.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [20:40:28] * denisse looking [20:41:58] hi [20:44:51] RESOLVED: TransitPeeringTransportOutboundSaturation: Transit, peering or transport outbound traffic above 90% capacity - cr2-eqiad:xe-3/2/1 (Transport: cr1-esams:xe-0/0/7 (Colt, ... [20:44:51] 445419311 80ms 10Gbps wave) {#2013}) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#Primary_outbound_port_utilization_over_90% - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr2-eqiad:9804 - https://alerts.wikimedia.org/?q=alertname%3DTransitPeeringTransportOutboundSaturation [20:45:53] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86122 and previous config saved to /var/cache/conftool/dbconfig/20251128-204552-marostegui.json [21:01:01] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P86123 and previous config saved to /var/cache/conftool/dbconfig/20251128-210100-marostegui.json [21:05:10] FIRING: [6x] ProbeDown: Service wdqs2008:443 has failed probes (http_wdqs_main_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:08:07] (03PS1) 10Arlolra: Deploy Parsoid Read Views to 19 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1212670 (https://phabricator.wikimedia.org/T411283) [21:16:08] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86124 and previous config saved to /var/cache/conftool/dbconfig/20251128-211608-marostegui.json [21:16:15] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [21:16:16] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [21:16:25] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2193.codfw.wmnet with reason: Maintenance [21:16:32] !log marostegui@cumin1003 dbctl commit (dc=all): 'Depooling db2193 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86125 and previous config saved to /var/cache/conftool/dbconfig/20251128-211632-marostegui.json [21:30:10] FIRING: [6x] CalicoHighMemoryUsage: Calico container calico-node-2rrk2:calico-node is consistently using three times its memory request - https://wikitech.wikimedia.org/wiki/Calico#Resource_Usage - https://alerts.wikimedia.org/?q=alertname%3DCalicoHighMemoryUsage [22:09:05] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86126 and previous config saved to /var/cache/conftool/dbconfig/20251128-220904-marostegui.json [22:09:12] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:09:13] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:24:13] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86127 and previous config saved to /var/cache/conftool/dbconfig/20251128-222412-marostegui.json [22:39:20] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193', diff saved to https://phabricator.wikimedia.org/P86128 and previous config saved to /var/cache/conftool/dbconfig/20251128-223920-marostegui.json [22:54:28] !log marostegui@cumin1003 dbctl commit (dc=all): 'Repooling after maintenance db2193 (T411163 T411164)', diff saved to https://phabricator.wikimedia.org/P86129 and previous config saved to /var/cache/conftool/dbconfig/20251128-225427-marostegui.json [22:54:36] T411163: Drop ar_sha1 from archive table in wmf production - https://phabricator.wikimedia.org/T411163 [22:54:37] T411164: Drop rev_sha1 from revision table in wmf production - https://phabricator.wikimedia.org/T411164 [22:54:44] !log marostegui@cumin1003 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2197.codfw.wmnet with reason: Maintenance [22:55:10] FIRING: CertAlmostExpired: Certificate for service data-gateway-staging:30443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#data-gateway-staging:30443 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:21:16] PROBLEM - snapshot of s2 in eqiad on backupmon1001 is CRITICAL: Last snapshot for s2 at eqiad (db1225) taken on 2025-11-28 22:23:32 is 869 GiB, but the previous one was 1037 GiB, a change of -16.3 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [23:30:10] FIRING: [4x] PuppetCertificateAboutToExpire: Puppet CA certificate default-staging-certificate.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire