[00:01:26] RECOVERY - Check systemd state on phab2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:07:35] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder) [01:09:46] 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10tstarling) ms-fe1009 and ms-fe2009 account for almost all errors, according to the swift_proxy_server_er... [01:29:06] (03CR) 10Aaron Schulz: [C: 03+1] Temporarily disable xenon/excimer for switch maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165) (owner: 10Tim Starling) [01:50:59] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10Stevemunene) [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T0200) [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:39] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.1 [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/900722 (https://phabricator.wikimedia.org/T330207) [02:07:45] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.1 [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/900722 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [02:22:57] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.1 [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/900722 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T0300) [03:01:25] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901327 (https://phabricator.wikimedia.org/T330207) [03:01:27] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901327 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [03:02:12] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901327 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [03:02:38] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.1 refs T330207 [03:02:44] T330207: 1.41.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T330207 [03:55:16] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.1 refs T330207 (duration: 52m 38s) [03:55:22] T330207: 1.41.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T330207 [03:57:36] !log mwpresync@deploy2002 Pruned MediaWiki: 1.40.0-wmf.26 (duration: 02m 18s) [04:31:38] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 5 (krb2002, ...), Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:50:03] (03PS4) 10Legoktm: Add to verify Mastodon account on mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 [04:50:16] (03CR) 10Legoktm: Add to verify Mastodon account on mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/896837 (owner: 10Legoktm) [05:03:52] (03PS14) 10KartikMistry: WIP: Add new self hosted machinetranslation service (MinT) [deployment-charts] - 10https://gerrit.wikimedia.org/r/897634 (https://phabricator.wikimedia.org/T331505) [05:50:22] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) p:05Triage→03High [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T0600) [06:00:05] kormat, marostegui, and Amir1: Time to snap out of that daydream and deploy Primary database switchover. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T0600). [06:02:46] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1154 - https://phabricator.wikimedia.org/T332649 (10Marostegui) p:05Triage→03Medium @wiki_willy I don't think this host is in under guarantee, however it is an important host for us. Any chances we can get a new (or spare) disk for it? I think it is meant to... [06:12:05] 10SRE, 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Marostegui) Data check finished successfully. The host is ready to be repooled once we've figured out why this happened [06:34:05] (03PS8) 10Winston Sung: SiteMatrix config: Add actual (non-deprecated) language code for deprecated language codes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/884494 (https://phabricator.wikimedia.org/T172035) [06:49:01] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186 (10Tgr) >>! In T74186#8103651, @tstarling wrote: > Note that I'm baking this bug into the ATS config in [[https://gerrit.wikimedia.org/r/c/op... [06:49:34] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 13150 [06:50:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 13150 [06:54:10] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) All three legs are [[https://gerrit.wikimedia.org/r/plugins/gitiles/o... [07:00:04] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:03:43] yeah, nothing to deploy apparently [07:09:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:14:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:16:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2010:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:19:04] (03PS1) 10Gergő Tisza: multi-dc: Use primary for OAuth for both URL forms [puppet] - 10https://gerrit.wikimedia.org/r/901333 (https://phabricator.wikimedia.org/T313578) [07:21:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2010:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:24:20] (03PS12) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [07:24:55] (03CR) 10Elukey: services: add the first lift wing stream to change-prop (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [07:29:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:32:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (4) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:34:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:35:45] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:37:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (5) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:37:45] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (4) Blazegraph instance wdqs1005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:37:56] (03CR) 10Elukey: [C: 03+2] ml-services: new revert-risk multilingual model and image [deployment-charts] - 10https://gerrit.wikimedia.org/r/901308 (https://phabricator.wikimedia.org/T332392) (owner: 10AikoChou) [07:38:54] (03CR) 10Elukey: [C: 03+2] install_server: update netboot config for kafka-main nodes [puppet] - 10https://gerrit.wikimedia.org/r/901239 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [07:40:30] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:44:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:49:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1009:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:54:37] (03PS1) 10DCausse: blazegraph monitor free allocators burn rate only on wdqs & wcqs [alerts] - 10https://gerrit.wikimedia.org/r/901506 [07:54:59] (03PS2) 10DCausse: blazegraph: monitor free allocators burn rate only on wdqs & wcqs [alerts] - 10https://gerrit.wikimedia.org/r/901506 [07:55:44] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:55:54] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:15] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1011:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:56:16] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:30] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:56:50] PROBLEM - Check systemd state on cephosd1003 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1003.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:56] 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence (work done), 10serviceops, 10Datacenter-Switchover, 10Sustainability (Incident Followup): Globalize mwconfig ReadOnly - https://phabricator.wikimedia.org/T330304 (10Joe) I think there might be valid reasons to have one datacenter read-only gl... [07:58:30] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (5) Blazegraph instance wdqs1013:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:59:08] (03CR) 10DCausse: [C: 03+2] blazegraph: monitor free allocators burn rate only on wdqs & wcqs [alerts] - 10https://gerrit.wikimedia.org/r/901506 (owner: 10DCausse) [08:00:28] (03Merged) 10jenkins-bot: blazegraph: monitor free allocators burn rate only on wdqs & wcqs [alerts] - 10https://gerrit.wikimedia.org/r/901506 (owner: 10DCausse) [08:00:30] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1011:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:00:34] 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) [08:02:45] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1013:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:04:05] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for Envoy - https://phabricator.wikimedia.org/T304799 (10Joe) I'm not sure this is really actionable without any number attached. We alr... [08:07:07] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10Platform Engineering, 10Sustainability (Incident Followup): Investigate shorter-lived persistent connections for change-propagation - https://phabricator.wikimedia.org/T304799 (10Joe) [08:07:30] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:12:48] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:13:12] PROBLEM - Check systemd state on cephosd1004 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1004.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:28] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:39] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10Joe) @akosiaris anything left to do for this task? I would assume you... [08:18:00] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:18:27] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) I see... [08:18:32] PROBLEM - Check systemd state on cephosd1005 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1005.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:10] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:18] (03PS1) 10Hashar: systemd::timer::job: space up sections in email body [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) [08:19:49] (03CR) 10Hashar: "I tend to like newlines spacing between sections to ease my brain processing :]" [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [08:20:02] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [08:21:54] (03CR) 10CI reject: [V: 04-1] systemd::timer::job: space up sections in email body [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [08:22:55] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10serviceops, 10Sustainability (Incident Followup): Ensure Changeprop is disabled when the databases are in read only mode - https://phabricator.wikimedia.org/T281240 (10Joe) [08:23:47] 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Joe) [08:24:26] 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Have some dedicated jobrunners that aren't active videoscalers - https://phabricator.wikimedia.org/T279100 (10Joe) 05Open→03Resolved a:03Joe [08:26:27] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) Requested the creation of `operations/debs/cqlsh4 ` in https://www.mediawiki.org/wiki/Gerrit/New_repositories/Requests [08:28:20] (03CR) 10Elukey: [C: 03+2] profile::cache::purge: move purged to a new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/901118 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [08:29:51] 10SRE-Sprint-Week-Sustainability-March2023, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10Joe) Given videoscaling happens asynchronously o... [08:31:47] !log move purged daemons on cp nodes to a new CA bundle (to allow accepting kafka clients using PKI tls certs) - T319372 [08:31:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:53] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [08:34:42] 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10serviceops-radar, 10Sustainability (Incident Followup): depool / confctl commands should print warnings or errors if too many nodes from that service are already depooled - https://phabricator.wikimedia.org/T245059 (10Joe) p:05Medium→03Low This... [08:35:33] (03PS2) 10Hashar: systemd::timer::job: space up sections in email body [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) [08:39:16] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:40] (03CR) 10Hashar: [C: 03+1] Allow E_DEPRECATED logs to be shown on php-fpm in doc machines (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900369 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [08:41:36] 10SRE-Sprint-Week-Sustainability-March2023, 10Platform Engineering Roadmap Decision Making, 10Traffic, 10serviceops, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Joe) 05Open→03Declined The task was more or less refused by the owners of the subs... [08:43:52] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:13] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10elukey) @MoritzMuehlenhoff given how simple this use case is, I'd just avoid to keep track of the whole cassandra upstream branch in the new repo, to just have one main branch with the debian con... [08:44:58] PROBLEM - Check systemd state on cephosd1001 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1001.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:32] PROBLEM - Check systemd state on cephosd1002 is CRITICAL: CRITICAL - degraded: The following units failed: ceph-mgr@cephosd1002.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:03:39] (03CR) 10Hashar: [C: 03+1] "I think we will want to turn ECS logging to have the access logs searchable. In `modules/profile/files/doc/httpd-doc.wikimedia.org.conf`:" [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney) [09:04:45] 10SRE-Sprint-Week-Sustainability-March2023, 10Thumbor, 10serviceops, 10Sustainability (Incident Followup): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10Joe) [09:05:41] 10SRE-Sprint-Week-Sustainability-March2023, 10Thumbor, 10serviceops, 10Sustainability (Incident Followup): Reverse proxy supporting XFF-based per-IP concurrency limit and request queueing - https://phabricator.wikimedia.org/T252749 (10Joe) 05Open→03Declined While this task is definitely too big for spr... [09:05:51] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC [09:05:52] (03CR) 10Hashar: "Puppet compiler for basic hosts is a noop: https://puppet-compiler.wmflabs.org/output/901536/1671/" [puppet] - 10https://gerrit.wikimedia.org/r/901536 (https://phabricator.wikimedia.org/T330120) (owner: 10Hashar) [09:06:06] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cephosd[1001-1005].eqiad.wmnet with reason: Systemd units failing, pupper tries to bring them up periodically, spam on IRC [09:07:19] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [09:08:35] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10Joe) I think that with the new structure we've put in place for mcrouter we don't... [09:08:45] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup), 10User-jijiki: add monitoring of sustained memcached TKO rates - https://phabricator.wikimedia.org/T253384 (10Joe) 05Open→03Declined [09:08:51] 10SRE, 10serviceops, 10Patch-For-Review, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [09:09:30] RECOVERY - Check systemd state on cephosd1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:52] 10SRE-Sprint-Week-Sustainability-March2023, 10Deployments, 10serviceops-radar, 10Release-Engineering-Team (Radar), and 2 others: Remove provisioning for 'mwscript', 'foreachwikiindblist' etc from deployment host - https://phabricator.wikimedia.org/T253822 (10Joe) [09:11:36] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): mcrouter memcached flapping in gutter pool - https://phabricator.wikimedia.org/T255511 (10Joe) 05Open→03Resolved a:03Joe I think this task was completed. Feel free to reopen if that's not the case. [09:11:38] 10SRE, 10serviceops, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Upgrade and improve our application object caching service (memcached) - https://phabricator.wikimedia.org/T244852 (10Joe) [09:12:18] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [09:13:28] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10Joe) [09:15:28] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10MoritzMuehlenhoff) >>! In T310980#8712952, @elukey wrote: > @MoritzMuehlenhoff given how simple this use case is, I'd just avoid to keep track of the whole cassandra upstream branch in the new re... [09:16:19] 10SRE-Sprint-Week-Sustainability-March2023, 10DynamicPageList (Wikimedia), 10serviceops-radar, 10Patch-For-Review, and 7 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10Joe) [09:17:00] (03PS2) 10Vgutierrez: haproxy: Allow specifying maxconn per backend [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) [09:18:12] RECOVERY - Check systemd state on cephosd1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:18:35] 10SRE-Sprint-Week-Sustainability-March2023, 10Phabricator, 10serviceops-collab, 10serviceops-radar, and 2 others: Phabricator: Unable to view tasks in DB read-only mode - https://phabricator.wikimedia.org/T313879 (10Joe) [09:18:53] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40225/console" [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [09:19:02] 10SRE, 10Traffic-Icebox: Create a second text-lb IP address for test purposes - https://phabricator.wikimedia.org/T237492 (10Aklapper) a:05BBlack→03None @BBlack: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023.... [09:20:33] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Joe) [09:20:47] 10SRE, 10MediaWiki-extensions-CodeReview, 10Platform Engineering, 10serviceops-radar: Make an HTML dump of the output of the CodeReview extension on MediaWiki.org - https://phabricator.wikimedia.org/T205361 (10Aklapper) a:05Legoktm→03None @Legoktm: Removing task assignee as this open task has been assi... [09:21:51] (03PS3) 10Vgutierrez: haproxy: Allow specifying maxconn per backend [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) [09:22:13] 10SRE, 10Phabricator, 10Traffic: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Aklapper) a:05mmodell→03None @mmodell: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assi... [09:23:04] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40226/console" [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [09:23:09] 10SRE-OnFire, 10Beta-Cluster-Infrastructure: Add basic alerting to the Beta Cluster - https://phabricator.wikimedia.org/T315695 (10Joe) Incident followup tags should be reserved for production issues. [09:24:45] (03PS4) 10Vgutierrez: haproxy: Allow specifying maxconn per backend [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) [09:24:48] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Cassandra, 10Sustainability (Incident Followup): Document best-practice for hinted-handoff - https://phabricator.wikimedia.org/T315517 (10Volans) [09:24:54] (03CR) 10Vgutierrez: haproxy: Allow specifying maxconn per backend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [09:25:23] !log phedenskog@deploy2002 Started deploy [performance/navtiming@d2b97ad]: (no justification provided) [09:25:29] !log phedenskog@deploy2002 Finished deploy [performance/navtiming@d2b97ad]: (no justification provided) (duration: 00m 06s) [09:30:08] (03CR) 10Jaime Nuche: [C: 04-1] devtools common.yaml: Set profile::mediawiki::scap_client::is_master to false (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901309 (owner: 10Ahmon Dancy) [09:31:36] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10Volans) [09:31:42] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Orchestrator, and 2 others: Puppet host certs do not contain Subject Alt Name entries - https://phabricator.wikimedia.org/T273637 (10Aklapper) a:05jbond→03None @jbond: Removing task assignee as this open task has been assigned for more than two years - S... [09:31:48] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: Review puppetmaster SSL configuration - https://phabricator.wikimedia.org/T268040 (10Aklapper) a:05jbond→03None @jbond: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee o... [09:31:58] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10User-jbond: in puppet 6 some core types have been moved to external modules. check and confirm our exposure - https://phabricator.wikimedia.org/T265143 (10Aklapper) a:05jbond→03None @jbond: Removing task assignee as this open task has been assigned for... [09:32:04] 10Puppet, 10Infrastructure-Foundations, 10Patch-Needs-Improvement, 10User-jbond: Refactor puppet-merge - https://phabricator.wikimedia.org/T254249 (10Aklapper) a:05jbond→03None @jbond: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assign... [09:32:24] (03CR) 10Tim Starling: [C: 03+1] "Note for deployment: restart varnish after puppet finishes." [puppet] - 10https://gerrit.wikimedia.org/r/901333 (https://phabricator.wikimedia.org/T313578) (owner: 10Gergő Tisza) [09:32:44] (03CR) 10Tim Starling: [C: 03+1] multi-dc: Use primary for OAuth for both URL forms (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901333 (https://phabricator.wikimedia.org/T313578) (owner: 10Gergő Tisza) [09:34:03] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/900635 (owner: 10Volans) [09:34:12] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Sustainability (Incident Followup): create a sampled log of POST data - https://phabricator.wikimedia.org/T309186 (10Volans) [09:34:17] 10SRE-OnFire, 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10User-Ryasmeen, 10Wikimedia-Incident: Beta cluster Error: 502, Next Hop Connection Failed - https://phabricator.wikimedia.org/T315350 (10taavi) [09:34:48] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/900636 (owner: 10Volans) [09:34:52] 10SRE, 10MW-on-K8s, 10Shellbox, 10serviceops, 10Platform Team Workboards (Purple): Make Shellbox actually do streaming - https://phabricator.wikimedia.org/T268427 (10Aklapper) a:05tstarling→03None @tstarling: Removing task assignee as this open task has been assigned for more than two years - See the... [09:34:58] 10SRE-Sprint-Week-Sustainability-March2023, 10PyBal, 10Traffic, 10Traffic-Icebox, 10Sustainability (Incident Followup): Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10Vgutierrez) [09:35:06] (03CR) 10Jbond: [C: 03+1] tox: make config compatible with tox 4.x [software/spicerack] - 10https://gerrit.wikimedia.org/r/900637 (owner: 10Volans) [09:35:08] 10SRE-Sprint-Week-Sustainability-March2023, 10PyBal, 10Traffic, 10Traffic-Icebox, 10Sustainability (Incident Followup): Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10Vgutierrez) [09:35:30] 10SRE, 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10Joe) [09:35:38] 10SRE, 10PyBal, 10Traffic-Icebox: pybal's "can-depool" logic only takes downServers into account - https://phabricator.wikimedia.org/T184715 (10Vgutierrez) [09:37:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8711755, @Papaul wrote: > @cmooney Please see first batch proposal. We can mov... [09:38:34] 10SRE, 10Observability-Alerting: Two close pages for idle workers api + appserver didn't auto-resolve on recovery - https://phabricator.wikimedia.org/T266570 (10Aklapper) a:05herron→03None @herron: Removing task assignee as this open task has been assigned for more than two years - See the email sent to ta... [09:38:48] 10SRE, 10Traffic: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Aklapper) a:05ssingh→03None @ssingh: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feb... [09:39:30] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, attempt to reimage [09:39:43] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, attempt to reimage [09:40:04] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Patch-For-Review, 10Sustainability (Incident Followup): Modernize etcd tlsproxy certificate management - https://phabricator.wikimedia.org/T307382 (10jbond) >>! In T307382#8708705, @Joe wrote: > I think there is a larger topic of moving etcd to us... [09:43:03] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1005.eqiad.wmnet with OS bullseye [09:43:17] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501 (10Volans) [09:43:27] 10SRE-OnFire (FY2021/2022-Q3), 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence (work done), 10Platform Engineering, and 2 others: Incident: 2022-03-10 MediaWiki availability affected due to a database query processing slowdown affecting most of the r... - https://phabricator.wikimedia.org/T303499 [09:43:47] 10SRE: move tunnelencabulator's repo to a Wikimedia-owned space - https://phabricator.wikimedia.org/T266783 (10Aklapper) a:05CDanis→03None @CDanis: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023. Please assign th... [09:44:00] 10SRE, 10wmf-sre-laptop: distribute tunnelencabulator in wmf-sre-laptop - https://phabricator.wikimedia.org/T266784 (10Aklapper) a:05CDanis→03None @CDanis: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023. Please... [09:44:16] 10SRE, 10Traffic, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10Vgutierrez) [09:44:33] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): Puppet doesn't restart ferm on failure - https://phabricator.wikimedia.org/T206951 (10Vgutierrez) 05Open→03Resolved a:03jbond This is actually already fixed by https://gerrit.wikimedia.org/r/c/operations/puppet... [09:44:51] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10Volans) Is there any concrete actionable here? [09:46:40] 10SRE: Updated java security policy in OpenJDK 8 u265 - https://phabricator.wikimedia.org/T261196 (10Aklapper) a:05MoritzMuehlenhoff→03None @MoritzMuehlenhoff: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feburary 22nd, 2023. Plea... [09:46:42] 10Puppet, 10Infrastructure Security, 10Infrastructure-Foundations, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10Aklapper) a:05MoritzMuehlenhoff→03None @MoritzMuehlenhoff: Removing task assignee as this open task has been assign... [09:47:17] 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [09:50:57] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Cassandra, 10Data-Persistence, 10Sustainability (Incident Followup): Document best-practice for hinted-handoff - https://phabricator.wikimedia.org/T315517 (10Volans) [09:51:22] 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8713630, @aborrero wrote: >>>! In T327919#8711755, @Papaul wrote: >> @cmooney P... [09:52:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [09:53:48] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Observability-Alerting: Migrate Foundations Prometheus alerts to AlertManager - https://phabricator.wikimedia.org/T294564 (10jbond) [09:56:46] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Vgutierrez)... [09:56:52] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10Stevemunene) a:03Stevemunene [09:58:26] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Data-Persistence, 10Sustainability (Incident Followup): 2022-08-24 swift incident (tracking) - https://phabricator.wikimedia.org/T317358 (10MatthewVernon) "Write up an incident report", perhaps? [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1000) [10:00:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:00:50] (03PS1) 10Btullis: Omit anaconda-wmf packages from bullseye onwards [puppet] - 10https://gerrit.wikimedia.org/r/901543 (https://phabricator.wikimedia.org/T329363) [10:02:25] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40227/console" [puppet] - 10https://gerrit.wikimedia.org/r/901543 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [10:03:52] (03PS1) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:04:39] (03CR) 10JMeybohm: "Naming nit. LGTM otherwise" [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 (owner: 10Hnowlan) [10:05:52] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:08:34] (03PS2) 10Hnowlan: changeprop: allow setting strategy [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 [10:08:40] (03CR) 10Hnowlan: changeprop: allow setting strategy (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/901246 (owner: 10Hnowlan) [10:09:29] (03CR) 10Btullis: [C: 03+1] spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [10:10:23] (03CR) 10AikoChou: "Thank you for working on this, Luca!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [10:10:31] (03CR) 10AikoChou: [C: 03+1] services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [10:12:13] (03PS2) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:12:46] RECOVERY - Check systemd state on cephosd1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:13:05] (03CR) 10Nicolas Fraison: [C: 03+2] spark-operator: enable spark operator mutation webhook [deployment-charts] - 10https://gerrit.wikimedia.org/r/897895 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [10:13:10] (03PS1) 10Jbond: P:monitoring: add docs and fix liniting errors [puppet] - 10https://gerrit.wikimedia.org/r/901546 (https://phabricator.wikimedia.org/T294564) [10:13:18] (03CR) 10Nicolas Fraison: [C: 03+2] spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [10:13:27] (03CR) 10CI reject: [V: 04-1] spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [10:13:31] (03PS7) 10Nicolas Fraison: spark: Allow communication from spark pods to HDFS/Hive [deployment-charts] - 10https://gerrit.wikimedia.org/r/899630 (https://phabricator.wikimedia.org/T331859) [10:13:32] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm boldly resolving t... [10:14:12] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:14:15] (03PS2) 10Giuseppe Lavagetto: graphite::alerts: add alert on mediawiki account creation failures [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) [10:14:22] !log joal@deploy2002 Started deploy [analytics/refinery@0bb61e9]: Regular analytics weekly train [analytics/refinery@0bb61e9] [10:16:00] (03CR) 10Jbond: [C: 03+2] P:monitoring: add docs and fix liniting errors [puppet] - 10https://gerrit.wikimedia.org/r/901546 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [10:17:03] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40229/console" [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) (owner: 10Giuseppe Lavagetto) [10:17:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:17:51] (03PS3) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:19:07] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40230/console" [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) (owner: 10Giuseppe Lavagetto) [10:19:45] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:20:29] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [10:20:35] (03PS4) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:20:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:21:53] (03PS1) 10Elukey: profile::kafka::mirror: default to use pki migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) [10:21:59] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] graphite::alerts: add alert on mediawiki account creation failures [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) (owner: 10Giuseppe Lavagetto) [10:22:10] !log joal@deploy2002 Finished deploy [analytics/refinery@0bb61e9]: Regular analytics weekly train [analytics/refinery@0bb61e9] (duration: 07m 48s) [10:22:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [10:22:21] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] graphite::alerts: add alert on mediawiki account creation failures (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901233 (https://phabricator.wikimedia.org/T146090) (owner: 10Giuseppe Lavagetto) [10:22:27] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:22:33] !log joal@deploy2002 Started deploy [analytics/refinery@0bb61e9] (thin): Regular analytics weekly train THIN [analytics/refinery@0bb61e9] [10:22:43] !log joal@deploy2002 Finished deploy [analytics/refinery@0bb61e9] (thin): Regular analytics weekly train THIN [analytics/refinery@0bb61e9] (duration: 00m 09s) [10:22:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WCQS_Streaming_Updater in eqiad (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [10:22:56] !log joal@deploy2002 Started deploy [analytics/refinery@0bb61e9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0bb61e9] [10:23:57] (03PS2) 10Kosta Harlan: GrowthExperiments: Enable Leveling Up features on pilot wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901144 (https://phabricator.wikimedia.org/T330358) [10:24:18] (03CR) 10Volans: [C: 03+2] setup.py: force dnspython from Bullseye [software/spicerack] - 10https://gerrit.wikimedia.org/r/900635 (owner: 10Volans) [10:24:26] !log joal@deploy2002 Finished deploy [analytics/refinery@0bb61e9] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@0bb61e9] (duration: 01m 30s) [10:26:10] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10MediaWiki-libs-Rdbms, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Fix mediawiki heartbeat model, change pt-heartbeat model to not use super-user, avoid SPOF and switch automatically to the... - https://phabricator.wikimedia.org/T172497 [10:26:30] (03PS2) 10Elukey: profile::kafka::mirror: default to use pki migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) [10:26:48] (03CR) 10Volans: [C: 03+2] service: improve check_dns_state validation check [software/spicerack] - 10https://gerrit.wikimedia.org/r/900636 (owner: 10Volans) [10:26:53] (03CR) 10Volans: [C: 03+2] tox: make config compatible with tox 4.x [software/spicerack] - 10https://gerrit.wikimedia.org/r/900637 (owner: 10Volans) [10:28:34] (03Merged) 10jenkins-bot: setup.py: force dnspython from Bullseye [software/spicerack] - 10https://gerrit.wikimedia.org/r/900635 (owner: 10Volans) [10:29:00] (03CR) 10Effie Mouzeli: [C: 03+1] maps: remove tilerator and cassandra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [10:30:34] (03Merged) 10jenkins-bot: service: improve check_dns_state validation check [software/spicerack] - 10https://gerrit.wikimedia.org/r/900636 (owner: 10Volans) [10:30:48] (03Merged) 10jenkins-bot: tox: make config compatible with tox 4.x [software/spicerack] - 10https://gerrit.wikimedia.org/r/900637 (owner: 10Volans) [10:31:39] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): LVS should handle losing a NIC on eqiad and codfw - https://phabricator.wikimedia.org/T286924 (10Volans) [10:31:42] (03CR) 10Vgutierrez: [C: 03+1] traffic: remove EdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:32:03] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [10:32:25] (03CR) 10Filippo Giunchedi: [C: 03+2] traffic: remove EdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [10:32:29] (03PS3) 10Filippo Giunchedi: traffic: remove EdgeTrafficDrop [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) [10:32:58] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Sustainability (Incident Followup): Improve automatic query killer under high load - https://phabricator.wikimedia.org/T293532 (10Volans) [10:33:35] (03CR) 10Vgutierrez: [C: 03+2] haproxy: Allow specifying maxconn per backend [puppet] - 10https://gerrit.wikimedia.org/r/901238 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [10:33:52] (03PS3) 10Elukey: profile::kafka::mirror: default to use pki migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) [10:34:27] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Patch-For-Review, 10Sustainability (Incident Followup): followups to unactionable NELHigh pages due to Telecom Italia outage, 2023-02-05 - https://phabricator.wikimedia.org/T328941 (10Volans) a:03Volans [10:35:36] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40233/console" [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [10:35:42] (03PS5) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:36:17] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682 (10Vgutierrez) a:03Vgutierrez [10:37:03] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup): High failure rate of account creation should trigger an alarm / page people - https://phabricator.wikimedia.org/T146090 (10Joe) 05Open→03Resolved [10:37:15] 10SRE, 10observability, 10Epic, 10Release-Engineering-Team (Radar), 10Sustainability (Incident Followup): Tracking: Monitoring and alerts for "business" metrics - https://phabricator.wikimedia.org/T140942 (10Joe) [10:37:27] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE-swift-storage, 10Commons, 10Data-Persistence, and 7 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Volans) [10:37:45] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:39:14] (03PS6) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:39:32] (03PS4) 10Elukey: profile::kafka::mirror: default to use pki migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) [10:39:34] (03PS1) 10Elukey: role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) [10:41:26] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:42:11] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:42:18] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Volans) a:03Volans [10:43:30] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [10:43:42] (03PS5) 10Elukey: profile::kafka::mirror: default to use pki migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) [10:43:44] (03PS2) 10Elukey: role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) [10:44:52] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40234/console" [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [10:47:15] (03PS7) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [10:47:21] (03PS1) 10Filippo Giunchedi: prometheus: remove k8s cache not updating alert [puppet] - 10https://gerrit.wikimedia.org/r/901550 (https://phabricator.wikimedia.org/T327792) [10:48:48] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40235/console" [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [10:49:26] (03CR) 10Elukey: role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [10:49:34] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:52:41] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Joe) a:03Joe This task is so sparse, and so much time has passed, that I'm not sure what the point is h... [10:53:58] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:58:18] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [10:58:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PATCH nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:00:53] (03PS1) 10Elukey: role::kafka::main: deploy PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) [11:02:03] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40236/console" [puppet] - 10https://gerrit.wikimedia.org/r/901551 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [11:04:24] !log joal@deploy2002 Started deploy [airflow-dags/analytics@42e862b]: Regular analytics weekly train [airflow-dags/analytics@42e862b] [11:04:35] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@42e862b]: Regular analytics weekly train [airflow-dags/analytics@42e862b] (duration: 00m 11s) [11:05:40] (03PS1) 10Volans: HaproxyUnavailable: add link to runbook [alerts] - 10https://gerrit.wikimedia.org/r/901552 (https://phabricator.wikimedia.org/T310933) [11:05:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:06:00] !log Kill mediawiki_denormalize oozie job [11:06:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:21] !log Unpause mediawiki_history_denormalize airflow job [11:07:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:29] (03PS1) 10Samwilson: Remove WikiEditor's Realtime Preview config vars [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) [11:08:38] !log Kill mediacounts_load oozie job [11:08:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:18] !log Unpause mediacounts_load airflow job with start_date set to 2023-03-21T10:00 [11:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:58] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:12:17] (03PS1) 10Nicolas Fraison: dse-k8s: authorize pods to connect to pki.discovery.wmnet:8443 [puppet] - 10https://gerrit.wikimedia.org/r/901554 [11:18:00] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Emergency response to logstash being backlogged - https://phabricator.wikimedia.org/T233735 (10Volans) [11:18:52] 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Metrics, 10Traffic-Icebox, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Volans) [11:19:09] 10SRE, 10observability, 10Patch-For-Review, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 (10BTullis) Five years on, I'm looking forward to suggesting that we add `atop` back to the fleet (or at least parts of it), now that the `-R` option has been removed by de... [11:20:53] !log aikochou@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [11:21:31] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10observability, 10Sustainability (Incident Followup): Per-country Frontend Traffic dashboards - https://phabricator.wikimedia.org/T286554 (10Volans) 05Open→03Resolved a:03Volans Boldly resolving as this need is currently satisfied by the... [11:22:25] 10SRE-Sprint-Week-Sustainability-March2023, 10Platform Engineering, 10serviceops-radar, 10Sustainability (Incident Followup): Adopt SLIs / SLOs for sessionstore - https://phabricator.wikimedia.org/T256629 (10Volans) [11:23:16] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [11:23:45] 10SRE-Sprint-Week-Sustainability-March2023, 10MediaWiki-General, 10SRE Observability, 10Wikimedia-Logstash, and 2 others: MediaWiki log spam during row D blip / rack D2 unavailable - https://phabricator.wikimedia.org/T233739 (10Volans) [11:25:13] (03PS1) 10Jbond: hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) [11:26:01] (03PS1) 10Hnowlan: imagemagick: rename self.exif to self.exif_dict [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901558 (https://phabricator.wikimedia.org/T331995) [11:26:25] (03CR) 10CI reject: [V: 04-1] hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [11:26:35] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache-htcacheclean.service,wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:27:29] (03PS1) 10Btullis: Add the python-is-python3 package to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) [11:29:02] (03PS2) 10Btullis: Add the python-is-python3 package to bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) [11:29:43] (03PS2) 10Jbond: hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) [11:31:42] (03CR) 10CI reject: [V: 04-1] hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [11:35:09] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/901554 (owner: 10Nicolas Fraison) [11:37:11] RECOVERY - Check systemd state on cephosd1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:35] (03PS1) 10Nicolas Fraison: hadoop: Authorize access from dse k8s pods to hdfs and hive-metastore [puppet] - 10https://gerrit.wikimedia.org/r/901561 (https://phabricator.wikimedia.org/T331859) [11:37:39] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Platform Engineering, 10Sustainability (Incident Followup): Improve slow read query handling - https://phabricator.wikimedia.org/T293530 (10Volans) [11:39:11] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Platform Engineering, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Alert when auto-increment fields on any MW-related databases reach a threshold - https://phabricator.wikimedia.org/T291332 (10Volans) [11:39:19] (03PS2) 10Nicolas Fraison: hadoop: Authorize access from dse k8s pods to hdfs and hive-metastore test [puppet] - 10https://gerrit.wikimedia.org/r/901561 (https://phabricator.wikimedia.org/T331859) [11:39:21] (03PS1) 10Nicolas Fraison: hadoop: Authorize access from dse k8s pods to hdfs and hive-metastore prod [puppet] - 10https://gerrit.wikimedia.org/r/901562 (https://phabricator.wikimedia.org/T331859) [11:40:25] (03PS1) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [11:40:54] (03PS2) 10Btullis: Omit anaconda-wmf packages from bullseye onwards [puppet] - 10https://gerrit.wikimedia.org/r/901543 (https://phabricator.wikimedia.org/T329363) [11:41:36] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [11:42:40] 10SRE-Sprint-Week-Sustainability-March2023, 10PoolCounter, 10serviceops, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Add monitoring of poolcounter service - https://phabricator.wikimedia.org/T83729 (10Joe) 05Open→03Resolved [11:46:05] 10SRE-Sprint-Week-Sustainability-March2023, 10DBA, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366 (10Volans) [11:46:18] (03CR) 10Btullis: [C: 03+1] "Looks good to me, thanks elukey." [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) (owner: 10Elukey) [11:47:18] 10SRE, 10observability, 10Upstream: atop on stretch overloading a host - https://phabricator.wikimedia.org/T192551 (10jcrespo) @BTullis While your suggestions seems reasonable, please note that the main reason why that was removed was the unwillingness of upstream to support our use cases. I personally have... [11:50:11] (03CR) 10Btullis: [C: 03+1] "Excellent! Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [11:50:41] (03CR) 10Muehlenhoff: Add the python-is-python3 package to bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [11:55:40] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10MoritzMuehlenhoff) Looks good, best to use ganeti group C. [12:02:12] (03PS4) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) [12:02:53] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/901543 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [12:03:56] 10Puppet, 10Infrastructure Security, 10Infrastructure-Foundations, 10User-jbond: Restrict GIDs for system users to 499 as the upper boundary - https://phabricator.wikimedia.org/T235162 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [12:06:45] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10eoghan) a:03eoghan [12:07:19] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Kubernetes, 10Sustainability (Incident Followup): Investigate whether running >1 replicas of calico-typha is feasible and prudent - https://phabricator.wikimedia.org/T292077 (10akosiaris) 05Open→03Resolved a:03akosiaris Nope, resolving it. [12:08:45] 10SRE, 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10akosiaris) Which team is on paper the owner of requestctl? [12:09:31] 10SRE, 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10akosiaris) Adding @KOfori as well, he might have an answer. [12:09:56] (03PS3) 10Jbond: hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) [12:10:17] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Volans) a:03Volans [12:10:42] (03CR) 10Jbond: "would also like to add tests" [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [12:11:12] (03CR) 10CI reject: [V: 04-1] hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [12:11:24] (03PS3) 10Btullis: Add the python-is-python3 package to hadoop:common on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) [12:11:26] 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, 10Sustainability (Incident Followup): Relax nodeAffinity of sessionstore - https://phabricator.wikimedia.org/T325139 (10akosiaris) Note that we also have taints on the dedicated to sessionstore nodes (albeit marked as kask, to avoid having other thing... [12:11:42] (03CR) 10Btullis: [C: 03+2] Omit anaconda-wmf packages from bullseye onwards [puppet] - 10https://gerrit.wikimedia.org/r/901543 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [12:13:51] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40237/console" [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [12:20:17] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:24:16] (03PS8) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:26:04] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:27:32] (03PS9) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:29:07] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/901562 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [12:29:23] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:29:40] (03CR) 10Btullis: [C: 03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/901561 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [12:30:20] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/901552 (https://phabricator.wikimedia.org/T310933) (owner: 10Volans) [12:31:56] (03PS10) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:32:56] 10SRE, 10SRE-Sprint-Week-Sustainability-March2023, 10conftool, 10Sustainability (Incident Followup): Make it easier to create a new requestctl object - https://phabricator.wikimedia.org/T310009 (10KOfori) That will be Traffic. [12:33:48] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:36:48] (03PS1) 10EoghanGaffney: Relax nodeAffinity of sessionstore pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/901572 (https://phabricator.wikimedia.org/T325139) [12:36:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901550 (https://phabricator.wikimedia.org/T327792) (owner: 10Filippo Giunchedi) [12:37:00] (03PS11) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:38:11] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove k8s cache not updating alert [puppet] - 10https://gerrit.wikimedia.org/r/901550 (https://phabricator.wikimedia.org/T327792) (owner: 10Filippo Giunchedi) [12:38:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901554 (owner: 10Nicolas Fraison) [12:38:55] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:40:05] (03PS12) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:41:55] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:44:14] (03PS13) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:46:16] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:48:50] (03CR) 10Kamila Součková: [C: 03+1] "LGTM, good job catching the bug. (Wouldn't it be nice to have type annotations one day?)" [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901558 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan) [12:49:04] Hi team - I have deployed Airflow code, new jobs work, but some old jobs are failing :( [12:49:46] woops - excuse me - wrong chan [12:53:38] (03PS14) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [12:55:35] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [12:58:33] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::kafka::mirror: default to use pki migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901547 (https://phabricator.wikimedia.org/T319372) (owner: 10Elukey) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1300). [13:00:05] No Gerrit patches in the queue for this window AFAICS. [13:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1300) [13:01:05] (03PS1) 10Hashar: zuul: fix up service enable and ensure [puppet] - 10https://gerrit.wikimedia.org/r/901576 [13:01:18] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (owner: 10Hashar) [13:01:24] (03CR) 10CI reject: [V: 04-1] zuul: fix up service enable and ensure [puppet] - 10https://gerrit.wikimedia.org/r/901576 (owner: 10Hashar) [13:03:48] if anyone has a Microsoft Windows, I would love a test of `composer typos` from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/899520 :] [13:04:36] (03PS2) 10Hashar: zuul: fix up service enable and ensure [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) [13:05:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] sre: add redis memory full alert [alerts] - 10https://gerrit.wikimedia.org/r/901141 (https://phabricator.wikimedia.org/T110169) (owner: 10Giuseppe Lavagetto) [13:05:42] (03CR) 10Hashar: [C: 04-1] "The zuul merger and zuul server have some breakage in Puppet. The merger and server use different parameters for ensure/enable. I am refa" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [13:05:47] !log move kafka mirror maker instances to PKI migration settings (new truststores) - T319372 [13:05:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:53] T319372: Move Kafka main to the new intermediate PKI CA - https://phabricator.wikimedia.org/T319372 [13:05:55] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [13:06:46] (03Merged) 10jenkins-bot: sre: add redis memory full alert [alerts] - 10https://gerrit.wikimedia.org/r/901141 (https://phabricator.wikimedia.org/T110169) (owner: 10Giuseppe Lavagetto) [13:07:51] (03CR) 10Nicolas Fraison: [C: 03+2] dse-k8s: authorize pods to connect to pki.discovery.wmnet:8443 [puppet] - 10https://gerrit.wikimedia.org/r/901554 (owner: 10Nicolas Fraison) [13:08:57] (03PS3) 10Elukey: role::kafka::jumbo::broker: enable PKI migration settings [puppet] - 10https://gerrit.wikimedia.org/r/901549 (https://phabricator.wikimedia.org/T296064) [13:09:45] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup): Add alerting for Memcached timeout errors - https://phabricator.wikimedia.org/T278946 (10Joe) 05Open→03Resolved We already added such an alert (porting it from check_prometheus) that is also... [13:11:09] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [13:11:22] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on kafka-main1005.eqiad.wmnet with reason: Stop kafka, update idrac/bios/nic-firmware [13:13:12] RECOVERY - Check systemd state on cephosd1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:14:21] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40238/console" [puppet] - 10https://gerrit.wikimedia.org/r/901561 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [13:15:48] (03CR) 10Nicolas Fraison: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40240/console" [puppet] - 10https://gerrit.wikimedia.org/r/901562 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [13:16:15] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] hadoop: Authorize access from dse k8s pods to hdfs and hive-metastore test [puppet] - 10https://gerrit.wikimedia.org/r/901561 (https://phabricator.wikimedia.org/T331859) (owner: 10Nicolas Fraison) [13:16:49] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1005.eqiad.wmnet [13:17:43] 10SRE, 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Papaul) @Marostegui physical inspection, everything look good I just verified the power cords on both sides (PDU's and server) to make sure that it is all the way plugged in. [13:18:55] (03CR) 10Jameel Kaisar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar) [13:19:26] 10SRE, 10ops-codfw, 10DBA: Unexplained reboot of es2029.codfw.wmnet - https://phabricator.wikimedia.org/T332603 (10Marostegui) p:05High→03Medium Thanks @Papaul given that...I am going to leave the server depooled until Monday and will repool back. That way we can give it a few days to make sure it is all... [13:20:34] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10fgiunchedi) cc {T302639} and @jbond since this came up in discussion [13:21:07] (RedisMemoryFull) firing: Redis memory full on gitlab2002:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=codfw%20prometheus/ops&var-job=redis_gitlab&var-instance=gitlab2002:9121&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:21:07] (RedisMemoryFull) firing: (2) Redis memory full on rdb2007:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:21:19] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:21:22] 10SRE, 10observability: Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084 (10fgiunchedi) cc {{T302639}} and @jbond since this came up in discussion [13:21:57] <_joe_> the redis memory alerts are expected. Some are ok, some are not [13:23:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:25:36] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:26:07] (RedisMemoryFull) firing: (4) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:26:07] (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [13:28:19] (03PS1) 10Nicolas Fraison: spark: add webhook rights to ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901581 (https://phabricator.wikimedia.org/T331858) [13:28:39] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:29:09] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main1005.eqiad.wmnet [13:33:24] (03Abandoned) 10Jbond: hardware: add Memory correctable errors (EDAC) alert [alerts] - 10https://gerrit.wikimedia.org/r/901557 (https://phabricator.wikimedia.org/T294564) (owner: 10Jbond) [13:33:31] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1005.eqiad.wmnet [13:33:47] (03PS1) 10Jbond: P:contacts: fixup lint [puppet] - 10https://gerrit.wikimedia.org/r/901585 [13:33:51] (03PS1) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [13:34:07] (03CR) 10Nicolas Fraison: [C: 03+2] spark: add webhook rights to ClusterRoleBinding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901581 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [13:35:21] 10SRE, 10ChangeProp, 10EventStreams, 10Image-Suggestion-API, and 4 others: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [13:35:33] (03PS2) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [13:35:44] (03CR) 10CI reject: [V: 04-1] P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [13:36:15] (03PS4) 10Btullis: Add the python-is-python3 package to hadoop:common on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) [13:37:01] (03CR) 10CI reject: [V: 04-1] P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [13:37:16] (03CR) 10Btullis: [C: 04-1] "I'm not sure that this is working as intended anyway. Even with python-is-python3 manually added, something is still adding python2.7." [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [13:37:16] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084 (10Volans) FYI I'm working on T253810 [13:38:49] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:38:54] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:39:52] (03PS3) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [13:40:40] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [13:41:31] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40243/console" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [13:42:15] (03PS2) 10Jbond: P:contacts: fixup lint [puppet] - 10https://gerrit.wikimedia.org/r/901585 [13:42:21] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:42:23] (03PS4) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [13:42:27] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'sync'. [13:42:35] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'sync'. [13:45:32] (03PS1) 10Urbanecm: Growth: Disable GEPersonalizedPraiseEnabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901588 (https://phabricator.wikimedia.org/T322443) [13:45:33] (03CR) 10Jbond: [C: 03+2] P:contacts: fixup lint [puppet] - 10https://gerrit.wikimedia.org/r/901585 (owner: 10Jbond) [13:45:50] (03PS2) 10Urbanecm: Growth: Disable GEPersonalizedPraiseEnabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901588 (https://phabricator.wikimedia.org/T322443) [13:46:01] (03PS5) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [13:47:56] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40244/console" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [13:48:32] (03CR) 10Jbond: "for pcc check the full diff e.g. https://puppet-compiler.wmflabs.org/output/901586/40244/an-airflow1004.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [13:48:36] (03CR) 10TheDJ: [C: 04-1] "The changed default is only deployed next week, so this should probably wait a week ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901553 (https://phabricator.wikimedia.org/T327515) (owner: 10Samwilson) [13:50:22] 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10SRE Observability: How should we monitor for faulty memory modules? - https://phabricator.wikimedia.org/T302639 (10jbond) 05Open→03In progress p:05Triage→03Medium a:05jhathaway→03jbond [13:51:06] (03PS27) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [13:56:35] (03PS3) 10Hashar: zuul: fix up service enable and ensure [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) [13:56:42] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [13:58:10] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:58:27] (03CR) 10Kosta Harlan: [C: 03+1] Growth: Disable GEPersonalizedPraiseEnabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901588 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [13:59:08] * urbanecm steals last two minutes of the window to sync a no-op patch [13:59:25] (03PS1) 10Nicolas Fraison: spark: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/901590 [13:59:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901588 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [13:59:54] (03PS4) 10Hashar: zuul: fix up service enable and ensure [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) [14:00:07] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:00:33] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:00:57] (03Merged) 10jenkins-bot: Growth: Disable GEPersonalizedPraiseEnabled everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901588 (https://phabricator.wikimedia.org/T322443) (owner: 10Urbanecm) [14:02:53] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:901588|Growth: Disable GEPersonalizedPraiseEnabled everywhere (T322443)]] [14:02:58] T322443: Personalized praise: new mentor dashboard module - https://phabricator.wikimedia.org/T322443 [14:05:57] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts kafka-main1005.eqiad.wmnet [14:05:59] (03PS28) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) [14:06:05] (03CR) 10JHathaway: [C: 03+1] maps: remove tilerator and cassandra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [14:06:34] (03CR) 10Nicolas Fraison: [C: 03+2] spark: bump chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/901590 (owner: 10Nicolas Fraison) [14:08:12] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:08:31] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:10:22] !log elukey@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts kafka-main1005.eqiad.wmnet [14:10:46] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:901588|Growth: Disable GEPersonalizedPraiseEnabled everywhere (T322443)]] (duration: 07m 53s) [14:10:51] T322443: Personalized praise: new mentor dashboard module - https://phabricator.wikimedia.org/T322443 [14:11:42] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:11:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [14:14:37] !log jnuche@deploy2002 Installing scap version "latest" for 587 hosts [14:14:59] (03CR) 10Hashar: [C: 03+1] "Puppet compiler https://puppet-compiler.wmflabs.org/output/901576/1675/" [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar) [14:15:34] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:15:53] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:15:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:17:49] (03PS5) 10Hashar: zuul: fix up service enable and ensure [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) [14:17:51] (03PS4) 10Hashar: site: add contint2002 to ci::master role [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [14:17:57] !log jnuche@deploy2002 Installing scap version "latest" for 587 hosts [14:18:18] (03PS1) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) [14:18:30] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [14:20:19] (03CR) 10Muehlenhoff: "chexk experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [14:20:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [14:20:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:21:00] PROBLEM - Host db1150 is DOWN: PING CRITICAL - Packet loss = 100% [14:21:17] ^ known? [14:21:24] doubt it [14:21:39] should we depool it? [14:22:37] it's not a master, we can depool it [14:22:39] marostegui: heads up [14:23:00] I think it is not mediawiki [14:23:05] that's backups [14:23:06] related [14:23:07] I think it is a backup source [14:23:08] ok [14:23:09] let me double check [14:23:15] yeah, it is [14:23:15] it's s4/s5 [14:23:18] jynus: do you want me to handle it? [14:23:30] no I can [14:23:33] no, I can [14:23:37] I'll create the task [14:23:55] but if it is the 2nd time a host loses power, it could become a trend [14:23:58] That could be something to add, how to know if a db is a backup host or not [14:24:24] claime: I agree, I had a dashboard but wasn't accepted [14:24:37] so I have to use memory [14:24:40] https://phabricator.wikimedia.org/T332708 [14:24:52] jynus: but they are in different DCs [14:25:09] still, it is very random [14:25:16] let me check the logs [14:25:36] RAM error [14:25:46] I was already on the console [14:26:10] claime: if you have handy, mind pasting it on the task? [14:26:16] yeah [14:26:16] if not we can do it [14:26:47] done [14:26:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:27:03] uff, classic :( [14:27:07] let me add dc ops [14:27:21] 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 unexpectedly down - https://phabricator.wikimedia.org/T332708 (10Marostegui) [14:27:38] jynus: We probably need to get them involved to test the DIMM/main board [14:27:46] I can't even have a terminal big enough to go before these errors [14:27:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts kafka-main1005.eqiad.wmnet [14:28:10] All DIMM_A8 [14:28:28] claime: yeah, it can be the dimm itself or the mainboard, that's why we need DCOps to help [14:28:34] 10SRE, 10Machine-Learning-Team, 10serviceops-radar, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) 05Open→03Resolved [14:28:47] 10SRE, 10Machine-Learning-Team, 10serviceops-radar, 10artificial-intelligence: New Service Request 'open_nsfw' - https://phabricator.wikimedia.org/T250110 (10calbon) This work is superceded by the NSFW model made by @Htriedman and deployed by @AikoChou to experiment model server. Closing ticket. [14:28:51] Usually they swap DIMMs and wait for the error to reproduce [14:29:13] I am checking if I have to depool the host [14:29:14] * marostegui standing by to see what jynus needs help with [14:29:23] from backups [14:29:40] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [14:29:41] e.g. if there was activity at the time [14:29:52] jynus: ok, I am going to ping DCOps on the task [14:29:57] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [14:30:08] claime: are you still logged in? [14:30:11] yes [14:30:29] 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops, 10Sustainability (Incident Followup), and 2 others: Monitor rdb hosts for memory/disk usage (redis_lock, aka redis_misc) - https://phabricator.wikimedia.org/T110169 (10Joe) 05Open→03Resolved [14:30:33] can you force a soft power reset? I don't see it coming in [14:30:39] ack [14:30:40] *up [14:30:53] normally there are memeory errors on boot [14:31:49] 10SRE-Sprint-Week-Sustainability-March2023, 10TimedMediaHandler-Transcode, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Add rate limiting to the jobqueue vidoscalers to prevent overloads - https://phabricator.wikimedia.org/T278945 (10Joe) a:03Joe [14:31:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:32:15] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup: db1150 unexpectedly down - https://phabricator.wikimedia.org/T332708 (10Marostegui) @wiki_willy we'd need help with the above. We probably need to get the DIMM swapped with another DIMM to see if the problem is the DIMM itself or the main board. This h... [14:32:25] 10SRE-Sprint-Week-Sustainability-March2023, 10WMF-JobQueue, 10serviceops, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10Joe) a:03Joe [14:32:55] it is indeed the chosen host for s4 and s5 for both dumps and snapshots [14:33:12] I have a backup for s4 backups [14:33:19] not so sure for s5 [14:34:06] claime: is it booting? [14:34:17] I can't find the right command :/ [14:34:26] jynus: We can re-provision db1133 if you need it [14:34:27] let me take over, I can show you later [14:34:34] (03CR) 10Volans: [C: 03+2] HaproxyUnavailable: add link to runbook [alerts] - 10https://gerrit.wikimedia.org/r/901552 (https://phabricator.wikimedia.org/T310933) (owner: 10Volans) [14:34:35] ack [14:34:55] I have enough hosts, don't worry, marostegui, it was just about time investment :-D [14:35:02] (03PS2) 10Bking: rdf-streaming-updater: use correct release and app [deployment-charts] - 10https://gerrit.wikimedia.org/r/901240 (https://phabricator.wikimedia.org/T328675) [14:35:23] I always recover those hosts from backups anyway [14:35:43] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [14:35:47] (03Merged) 10jenkins-bot: HaproxyUnavailable: add link to runbook [alerts] - 10https://gerrit.wikimedia.org/r/901552 (https://phabricator.wikimedia.org/T310933) (owner: 10Volans) [14:36:28] Ah, serveraction powercycle, found it... [14:36:30] jynus: Cool, I will stand by now. If you need me, speak up [14:36:43] claime: yep [14:36:46] (03PS1) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [14:36:59] https://wikitech.wikimedia.org/wiki/SRE/Dc-operations/Platform-specific_documentation/Dell_Documentation [14:37:02] Adding to on-call checklist :P [14:37:13] good one [14:37:33] which one? [14:37:54] Adding to on-call checklist :P -> that [14:37:55] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1005.eqiad.wmnet with OS bullseye [14:38:02] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [14:38:04] !log disabling puppet on maps* before merging 760619 [14:38:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:10] but which one, the link I sahred? [14:38:11] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1005.eqiad.wmnet with OS bullseye [14:38:22] jynus: In general [14:38:38] 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10RobH) 05Resolved→03Open Please note we've run into this issue again today: During the work of sprint week in reimaging hosts, Luca was using the sr... [14:39:11] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/867673/1676/" [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn) [14:39:13] (03PS2) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) [14:39:37] not sure I am getting any response with soft powercycle, waiting a bit just in case, otherwise doing a hard reset [14:39:53] or alternatively, the host is fried [14:40:01] (03PS10) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [14:40:03] jynus: Unfortunately, I have seen that before with memory issues, needing the crash kart to make it boot up [14:40:08] Not always, but sometimes [14:40:28] I am waiting a bit, just in case it is just a serial output artifact [14:40:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney Racks e5-7 f5-7 have been cabled and racked do you want to use same ticket for those Switches? [14:40:57] jouncebot: nowandnext [14:40:57] No deployments scheduled for the next 1 hour(s) and 19 minute(s) [14:40:57] In 1 hour(s) and 19 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1600) [14:41:22] going to disable notifications for host, this will take time [14:41:30] claime: you can resolve ongoing notifications if any [14:41:34] ack [14:41:34] *incidents [14:42:04] It didn't trigger a page [14:42:21] (03PS11) 10Hnowlan: maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) [14:42:47] (03CR) 10Hnowlan: maps: remove tilerator and cassandra (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [14:43:03] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) >>! In T332650#8712706, @Tgr wrote: > A... [14:44:09] (03PS1) 10Jcrespo: monitoring: Disable notifications for db1150 after crash [puppet] - 10https://gerrit.wikimedia.org/r/901601 (https://phabricator.wikimedia.org/T332708) [14:44:30] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1154 - https://phabricator.wikimedia.org/T332649 (10Cmjohnson) a:03Cmjohnson A new SSD has been requested from Dell. You have successfully submitted request SR164648098. [14:44:36] (03CR) 10CI reject: [V: 04-1] monitoring: Disable notifications for db1150 after crash [puppet] - 10https://gerrit.wikimedia.org/r/901601 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [14:45:43] (03CR) 10Hnowlan: [C: 03+2] maps: remove tilerator and cassandra [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [14:45:47] (03PS3) 10Samtar: InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900828 (https://phabricator.wikimedia.org/T332521) [14:46:02] (03PS2) 10Jcrespo: monitoring: Disable notifications for db1150 after crash [puppet] - 10https://gerrit.wikimedia.org/r/901601 (https://phabricator.wikimedia.org/T332708) [14:46:20] (03PS1) 10Giuseppe Lavagetto: changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) [14:46:51] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10jcrespo) [14:46:54] (03PS2) 10Giuseppe Lavagetto: changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) [14:47:18] (03CR) 10Marostegui: [C: 03+1] monitoring: Disable notifications for db1150 after crash [puppet] - 10https://gerrit.wikimedia.org/r/901601 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [14:47:32] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1154 - https://phabricator.wikimedia.org/T332649 (10Marostegui) Thank you! [14:47:37] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=maps1005.eqiad.wmnet [14:48:16] (03CR) 10Jcrespo: [C: 03+2] monitoring: Disable notifications for db1150 after crash [puppet] - 10https://gerrit.wikimedia.org/r/901601 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [14:48:31] (03CR) 10Hnowlan: [C: 03+2] imagemagick: rename self.exif to self.exif_dict [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901558 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan) [14:48:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:49:44] did a hard reset, still no activity [14:49:50] :-( [14:49:58] (03PS1) 10Btullis: Add the spark3 shuffle service jars to the yarn resourcemanager [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) [14:49:58] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=kartotherian,name=maps1005.eqiad.wmnet [14:51:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps1005.eqiad.wmnet [14:52:29] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [14:52:55] (03Merged) 10jenkins-bot: imagemagick: rename self.exif to self.exif_dict [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/901558 (https://phabricator.wikimedia.org/T331995) (owner: 10Hnowlan) [14:52:58] (03PS2) 10Btullis: Add the spark3 shuffle service jars to the yarn resourcemanager [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) [14:53:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:54:29] (03PS1) 10DCausse: rdf-streaming-updater: add jarURI [deployment-charts] - 10https://gerrit.wikimedia.org/r/901608 (https://phabricator.wikimedia.org/T328675) [14:54:55] (03PS1) 10Jbond: wmflib::argparse: allow specifying a different separator [puppet] - 10https://gerrit.wikimedia.org/r/901609 [14:54:57] (03PS1) 10Jbond: ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) [14:55:13] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [14:55:38] (03PS3) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) [14:55:46] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [14:55:52] (03PS2) 10Ahmon Dancy: devtools hiera: Move profile::mediawiki::scap_client::is_master: true to devtools-1004.yaml [puppet] - 10https://gerrit.wikimedia.org/r/901309 [14:55:59] (03CR) 10Ahmon Dancy: devtools hiera: Move profile::mediawiki::scap_client::is_master: true to devtools-1004.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901309 (owner: 10Ahmon Dancy) [14:56:20] (03CR) 10CI reject: [V: 04-1] devtools hiera: Move profile::mediawiki::scap_client::is_master: true to devtools-1004.yaml [puppet] - 10https://gerrit.wikimedia.org/r/901309 (owner: 10Ahmon Dancy) [14:56:27] jynus: It's dead, Jim. [14:57:32] (03PS3) 10Ahmon Dancy: devtools hiera: Fix profile::mediawiki::scap_client::is_master setting [puppet] - 10https://gerrit.wikimedia.org/r/901309 [14:57:39] (03PS3) 10Btullis: Add the spark3 shuffle service jars to the yarn resourcemanager [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) [14:59:14] RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901289 (https://phabricator.wikimedia.org/T309609) (owner: 10Samtar) [14:59:35] (03PS2) 10Jbond: ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) [14:59:55] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1005.eqiad.wmnet [15:00:13] (03Merged) 10jenkins-bot: wgAbuseFilterConditionLimit: Set default condition limit to 2000 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901289 (https://phabricator.wikimedia.org/T309609) (owner: 10Samtar) [15:00:39] !log samtar@deploy2002 Started scap: Backport for [[gerrit:901289|wgAbuseFilterConditionLimit: Set default condition limit to 2000 (T309609)]] [15:00:45] T309609: Increase $wgAbuseFilterConditionLimit - https://phabricator.wikimedia.org/T309609 [15:01:59] (03CR) 10BCornwall: [C: 03+1] "Thanks!" [alerts] - 10https://gerrit.wikimedia.org/r/900626 (https://phabricator.wikimedia.org/T309182) (owner: 10Filippo Giunchedi) [15:02:09] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [15:02:16] !log samtar@deploy2002 samtar: Backport for [[gerrit:901289|wgAbuseFilterConditionLimit: Set default condition limit to 2000 (T309609)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [15:02:27] (03PS4) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) [15:02:28] !log elukey@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1005.eqiad.wmnet with OS bullseye [15:02:56] (03PS1) 10Elukey: install_server: fix reuse-raid10-8dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/901611 (https://phabricator.wikimedia.org/T332013) [15:02:58] (03CR) 10Btullis: "Some nits but they're all in the comments/docs. +1 apart from those." [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [15:03:35] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10jcrespo) Sadly it doesn't powercycle from the management interface, so requiring "manual" power drain and reboot when possible from #DC-OPS. [15:04:32] (syncing) [15:06:05] (03PS1) 10Hashar: doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) [15:06:07] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: add jarURI [deployment-charts] - 10https://gerrit.wikimedia.org/r/901608 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:06:55] (03PS3) 10Jbond: ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) [15:07:07] (03CR) 10Elukey: [C: 03+2] install_server: fix reuse-raid10-8dev recipe [puppet] - 10https://gerrit.wikimedia.org/r/901611 (https://phabricator.wikimedia.org/T332013) (owner: 10Elukey) [15:08:00] (03CR) 10CI reject: [V: 04-1] doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [15:08:32] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10Cmjohnson) a:03Cmjohnson DIMM has been ordered through Dell [15:09:00] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [15:09:05] 10SRE, 10ops-eqiad: ps1-d1-eqiad and ps1-d6-eqiad down - https://phabricator.wikimedia.org/T332641 (10Cmjohnson) Acknowledged, will investigate and update task. [15:09:09] 10SRE, 10ops-eqiad: ps1-d1-eqiad and ps1-d6-eqiad down - https://phabricator.wikimedia.org/T332641 (10Cmjohnson) a:03Cmjohnson [15:09:42] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10jcrespo) The host will be left unused and with notifications disabled so it can be serviced at any time (no rush). Thank you. [15:10:12] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:901289|wgAbuseFilterConditionLimit: Set default condition limit to 2000 (T309609)]] (duration: 09m 32s) [15:10:18] T309609: Increase $wgAbuseFilterConditionLimit - https://phabricator.wikimedia.org/T309609 [15:10:38] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-main1005.eqiad.wmnet with OS bullseye [15:10:45] 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10odimitrijevic) Approved [15:11:00] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10Joe) a:05a... [15:11:35] (03Merged) 10jenkins-bot: rdf-streaming-updater: add jarURI [deployment-charts] - 10https://gerrit.wikimedia.org/r/901608 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:12:00] (03PS5) 10Muehlenhoff: Make Python2 removal on Bullseye configurable [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) [15:13:17] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [15:16:04] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900828 (https://phabricator.wikimedia.org/T332521) (owner: 10Samtar) [15:16:34] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:16:49] (03Merged) 10jenkins-bot: InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900828 (https://phabricator.wikimedia.org/T332521) (owner: 10Samtar) [15:17:09] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:17:12] !log samtar@deploy2002 Started scap: Backport for [[gerrit:900828|InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions (T332521)]] [15:17:17] T332521: Set 'blockautopromote', 'block' and 'rangeblock' global AbuseFilter actions as locally disabled - https://phabricator.wikimedia.org/T332521 [15:17:26] (03PS1) 10Muehlenhoff: Add htriedman to analytics-platform-eng-admins [puppet] - 10https://gerrit.wikimedia.org/r/901614 (https://phabricator.wikimedia.org/T331647) [15:18:34] (03PS4) 10Jbond: ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) [15:19:00] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10MoritzMuehlenhoff) @Htriedman This also needs approval by your manager on task, then we're good to merge https://gerrit.wikimedia.org/r/901614 [15:19:09] !log samtar@deploy2002 samtar: Backport for [[gerrit:900828|InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions (T332521)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [15:19:47] (03CR) 10Volans: [C: 03+1] "LGTM, check pcc too ;)" [puppet] - 10https://gerrit.wikimedia.org/r/901609 (owner: 10Jbond) [15:21:05] (syncing) [15:21:10] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40252/console" [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [15:21:19] (03PS5) 10Btullis: Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) [15:21:42] (03CR) 10CI reject: [V: 04-1] Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [15:21:58] (03PS2) 10Hashar: doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) [15:22:31] (03CR) 10CI reject: [V: 04-1] doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [15:22:39] (03CR) 10Volans: [C: 03+1] "LGTM, nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [15:22:56] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:23:47] (03PS1) 10Vgutierrez: hiera: Test maxconn per backend in cp4044 and cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/901616 (https://phabricator.wikimedia.org/T310609) [15:24:27] (03PS3) 10Hashar: doc: upgrade php from 7.3 to 7.4 [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) [15:24:29] (03CR) 10Muehlenhoff: Allow hive on bullseye to install and use the correct packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [15:25:10] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40253/console" [puppet] - 10https://gerrit.wikimedia.org/r/901616 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [15:25:25] (03PS1) 10DCausse: rdf-streaming-updater: jarURI should use local:// not file:// [deployment-charts] - 10https://gerrit.wikimedia.org/r/901617 (https://phabricator.wikimedia.org/T328675) [15:25:32] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: jarURI should use local:// not file:// [deployment-charts] - 10https://gerrit.wikimedia.org/r/901617 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:26:01] (03CR) 10Hnowlan: [C: 03+1] changeprop-jobqueue: reduce concurrency of video transcoding [deployment-charts] - 10https://gerrit.wikimedia.org/r/901602 (https://phabricator.wikimedia.org/T278945) (owner: 10Giuseppe Lavagetto) [15:26:24] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:900828|InitialiseSettings: Set wgAbuseFilterLocallyDisabledGlobalActions (T332521)]] (duration: 09m 11s) [15:26:28] (03PS2) 10DCausse: rdf-streaming-updater: jarURI should use local:// not file:// [deployment-charts] - 10https://gerrit.wikimedia.org/r/901617 (https://phabricator.wikimedia.org/T328675) [15:26:29] T332521: Set 'blockautopromote', 'block' and 'rangeblock' global AbuseFilter actions as locally disabled - https://phabricator.wikimedia.org/T332521 [15:26:41] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1005.eqiad.wmnet with reason: host reimage [15:27:03] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) @Jcross asking for approval from you — I need these rights in order to deploy DP scripts that will run on a schedule on airflow [15:27:29] (03PS6) 10Btullis: Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) [15:27:45] (03CR) 10Btullis: Allow hive on bullseye to install and use the correct packages (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [15:27:49] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [15:28:28] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: jarURI should use local:// not file:// [deployment-charts] - 10https://gerrit.wikimedia.org/r/901617 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:30:24] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks. I've added the overrides for the test cluster in here now: https://gerrit.wikimedia.org/r/c/operations/puppet/+/901559" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [15:31:12] (03PS5) 10Jbond: ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) [15:31:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:31:21] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [15:32:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1005.eqiad.wmnet with reason: host reimage [15:32:25] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/901612/1682/" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [15:32:35] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [15:33:20] (03PS1) 10Nicolas Fraison: spark: udapte networkpolicy to authorize kubernetes-api to contact webhook service [deployment-charts] - 10https://gerrit.wikimedia.org/r/901618 [15:33:40] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40254/console" [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [15:33:42] (03Merged) 10jenkins-bot: rdf-streaming-updater: jarURI should use local:// not file:// [deployment-charts] - 10https://gerrit.wikimedia.org/r/901617 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:33:49] (03CR) 10Jbond: [C: 03+2] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [15:34:03] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:34:27] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:34:54] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:35:35] (03PS2) 10Jbond: wmflib::argparse: allow specifying a different separator [puppet] - 10https://gerrit.wikimedia.org/r/901609 [15:35:53] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) @MoritzMuehlenhoff - alerting that my manager is back so he can sign-off should you need to contact him. Should he contact you? let me know. @Aklapp... [15:37:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40256/console" [puppet] - 10https://gerrit.wikimedia.org/r/901609 (owner: 10Jbond) [15:37:51] (03CR) 10Jbond: [V: 03+1 C: 03+2] wmflib::argparse: allow specifying a different separator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901609 (owner: 10Jbond) [15:38:01] (03PS6) 10Jbond: ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) [15:38:06] (03CR) 10Jbond: [V: 03+2] ipmi_exporter: add managed config file [puppet] - 10https://gerrit.wikimedia.org/r/901610 (https://phabricator.wikimedia.org/T302639) (owner: 10Jbond) [15:38:16] (03CR) 10Jaime Nuche: "As an interesting note, devtools hosts were being configured correctly before because the only `profile::mediawiki::scap_client` was deplo" [puppet] - 10https://gerrit.wikimedia.org/r/901309 (owner: 10Ahmon Dancy) [15:39:16] (03CR) 10Jaime Nuche: [C: 03+1] "From an offline conversation, this patch is probably going to be reworked a bit. But LGTM in current state." [puppet] - 10https://gerrit.wikimedia.org/r/901309 (owner: 10Ahmon Dancy) [15:39:25] (03PS2) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [15:39:27] (03PS1) 10DCausse: rdf-streaming-updater: fix docker image URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/901619 (https://phabricator.wikimedia.org/T328675) [15:40:00] (03PS4) 10Btullis: Use the spark3 shuffle jars to yarn on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) [15:40:20] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: fix docker image URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/901619 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:40:37] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:42:13] PROBLEM - Check systemd state on cp6002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:15] PROBLEM - Check systemd state on db2169 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:19] PROBLEM - Check systemd state on db2184 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:23] PROBLEM - Check systemd state on an-presto1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:23] PROBLEM - Check systemd state on ms-be2040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:25] PROBLEM - Check systemd state on kafka-jumbo1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:27] fix incomming [15:42:31] PROBLEM - Check systemd state on mw2337 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:33] PROBLEM - Check systemd state on lvs4008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:45] PROBLEM - Check systemd state on elastic1070 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:45] PROBLEM - Check systemd state on mw2326 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:47] PROBLEM - Check systemd state on elastic2081 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:47] PROBLEM - Check systemd state on an-worker1082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:47] PROBLEM - Check systemd state on kafka-jumbo1004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:49] !log stop puppet from deploying this further [15:42:49] PROBLEM - Check systemd state on ganeti2026 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:49] PROBLEM - Check systemd state on kafka-main2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:51] PROBLEM - Check systemd state on ml-cache1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:53] PROBLEM - Check systemd state on an-worker1129 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:57] PROBLEM - Check systemd state on aqs1016 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:42:57] PROBLEM - Check systemd state on an-tool1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:01] PROBLEM - Check systemd state on cp5031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:01] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:05] PROBLEM - Check systemd state on parse1019 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:07] (03PS3) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [15:43:07] PROBLEM - Check systemd state on sretest1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:13] PROBLEM - Check systemd state on elastic2074 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:15] PROBLEM - Check systemd state on mw1451 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:15] PROBLEM - Check systemd state on kafka-jumbo1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:17] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [15:43:25] PROBLEM - Check systemd state on mw1430 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:27] PROBLEM - Check systemd state on mw2449 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:27] PROBLEM - Check systemd state on elastic1099 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service,wmf_auto_restart_prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:29] PROBLEM - Check systemd state on mw2338 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:33] PROBLEM - Check systemd state on cp1077 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:39] PROBLEM - Check systemd state on elastic1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:43] jbond: ^ ? [15:43:47] PROBLEM - Check systemd state on db2132 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:47] PROBLEM - Check systemd state on mw1470 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:49] PROBLEM - Check systemd state on cp5025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:51] bblack: yes fixing sorry [15:43:53] PROBLEM - Check systemd state on ms-be2048 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:53] PROBLEM - Check systemd state on mc2044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:53] PROBLEM - Check systemd state on mw1410 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:54] ok [15:43:55] PROBLEM - Check systemd state on db1117 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:57] PROBLEM - Check systemd state on parse2018 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:01] PROBLEM - Check systemd state on analytics1064 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:03] PROBLEM - Check systemd state on mw2339 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:05] PROBLEM - Check systemd state on mw2427 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:07] PROBLEM - Check systemd state on cp3050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:11] PROBLEM - Check systemd state on db1177 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:17] PROBLEM - Check systemd state on ms-be1044 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:23] PROBLEM - Check systemd state on dumpsdata1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:23] PROBLEM - Check systemd state on dbprov1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:25] PROBLEM - Check systemd state on parse1012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:25] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [15:44:27] PROBLEM - Check systemd state on aqs1020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:29] PROBLEM - Check systemd state on mw2423 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:31] PROBLEM - Check systemd state on db1182 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:35] PROBLEM - Check systemd state on cp6008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:35] PROBLEM - Check systemd state on db1104 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:35] PROBLEM - Check systemd state on parse1023 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:43] PROBLEM - Check systemd state on db2167 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:43] PROBLEM - Check systemd state on mw1471 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:43] PROBLEM - Check systemd state on cp5030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:43] PROBLEM - Check systemd state on cp4042 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:47] PROBLEM - Check systemd state on db1187 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:53] PROBLEM - Check systemd state on ms-backup1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:54] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: (1) VM for MySQL Orchestrator - https://phabricator.wikimedia.org/T332718 (10jhathaway) [15:44:57] PROBLEM - Check systemd state on parse2006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:59] PROBLEM - Check systemd state on dns1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:44:59] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:00] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:01] PROBLEM - Check systemd state on mw2313 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:01] PROBLEM - Check systemd state on parse1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:03] PROBLEM - Check systemd state on clouddb1016 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:06] 10SRE, 10Infrastructure-Foundations, 10vm-requests: Site: (1) VM for MySQL Orchestrator - https://phabricator.wikimedia.org/T332718 (10jhathaway) a:03jhathaway [15:45:09] PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:11] PROBLEM - Check systemd state on db1136 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:11] PROBLEM - Check systemd state on elastic1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:15] PROBLEM - Check systemd state on db2112 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:15] PROBLEM - Check systemd state on db2115 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:17] PROBLEM - Check systemd state on ml-serve1008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:23] PROBLEM - Check systemd state on ms-be2062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:23] (03PS1) 10Jbond: ipmi-blackbox exporter fix [puppet] - 10https://gerrit.wikimedia.org/r/901621 [15:45:25] PROBLEM - Check systemd state on logstash2036 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:29] PROBLEM - Check systemd state on db2134 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:31] PROBLEM - Check systemd state on logstash1034 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:32] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM for MySQL Orchestrator - https://phabricator.wikimedia.org/T332718 (10jhathaway) [15:45:33] PROBLEM - Check systemd state on parse2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:33] PROBLEM - Check systemd state on elastic2065 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:33] PROBLEM - Check systemd state on lvs4009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:33] PROBLEM - Check systemd state on mw2279 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:33] PROBLEM - Check systemd state on mw2358 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:35] PROBLEM - Check systemd state on cp3059 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:35] PROBLEM - Check systemd state on mw2389 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:38] (03CR) 10Jbond: [V: 03+2 C: 03+2] ipmi-blackbox exporter fix [puppet] - 10https://gerrit.wikimedia.org/r/901621 (owner: 10Jbond) [15:45:39] PROBLEM - Check systemd state on wdqs1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:41] PROBLEM - Check systemd state on cp3065 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:43] PROBLEM - Check systemd state on mw2309 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:45] PROBLEM - Check systemd state on elastic1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:49] PROBLEM - Check systemd state on thanos-fe1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:51] PROBLEM - Check systemd state on dns6002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:53] PROBLEM - Check systemd state on elastic2043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:55] PROBLEM - Check systemd state on mw2408 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:55] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:55] PROBLEM - Check systemd state on kubernetes2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:57] PROBLEM - Check systemd state on mw1362 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:57] PROBLEM - Check systemd state on dumpsdata1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:57] PROBLEM - Check systemd state on cp5017 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:59] PROBLEM - Check systemd state on mw2432 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:59] PROBLEM - Check systemd state on wdqs2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:03] PROBLEM - Check systemd state on restbase1019 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:03] PROBLEM - Check systemd state on db2104 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:03] PROBLEM - Check systemd state on db2165 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:07] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:07] PROBLEM - Check systemd state on aqs1017 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:09] PROBLEM - Check systemd state on elastic1083 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:11] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: (1) VM for MySQL Orchestrator - https://phabricator.wikimedia.org/T332718 (10jhathaway) dborch1002.wikimedia.org [15:46:15] PROBLEM - Check systemd state on mw1436 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:18] (03Merged) 10jenkins-bot: rdf-streaming-updater: fix docker image URL [deployment-charts] - 10https://gerrit.wikimedia.org/r/901619 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse) [15:46:21] PROBLEM - Check systemd state on cp5027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:21] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10observability, 10serviceops-radar, 10Sustainability (Incident Followup): Monitor request throughput on etcd/confd hosts to prevent incidents of software requiring config reload too often - https://phabricator.wikimedia.org/T322400 (10Joe) p:05T... [15:46:25] PROBLEM - Check systemd state on ms-be1057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:27] PROBLEM - Check systemd state on dumpsdata1003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:31] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:31] PROBLEM - Check systemd state on logstash2027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:33] PROBLEM - Check systemd state on mw2381 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:35] PROBLEM - Check systemd state on clouddb1014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:37] PROBLEM - Check systemd state on db2103 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:37] PROBLEM - Check systemd state on mw1474 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:37] PROBLEM - Check systemd state on db2176 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:39] PROBLEM - Check systemd state on restbase2014 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:41] PROBLEM - Check systemd state on mw1379 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:47] PROBLEM - Check systemd state on db2133 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:50] (03PS1) 10Giuseppe Lavagetto: etcd: add alert for high traffic volumes [alerts] - 10https://gerrit.wikimedia.org/r/901622 (https://phabricator.wikimedia.org/T322400) [15:46:53] PROBLEM - Check systemd state on elastic2060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:55] PROBLEM - Check systemd state on mw1433 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:55] PROBLEM - Check systemd state on mc1050 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:57] PROBLEM - Check systemd state on mc1049 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:59] PROBLEM - Check systemd state on logstash2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:59] PROBLEM - Check systemd state on mw2293 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:59] PROBLEM - Check systemd state on analytics1069 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:46:59] PROBLEM - Check systemd state on parse1021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:03] PROBLEM - Check systemd state on ganeti1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:03] PROBLEM - Check systemd state on ganeti2030 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:05] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:05] PROBLEM - Check systemd state on restbase2025 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:05] PROBLEM - Check systemd state on mw2299 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:05] PROBLEM - Check systemd state on ms-be2066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:13] PROBLEM - Check systemd state on an-worker1135 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:15] PROBLEM - Check systemd state on graphite2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:17] PROBLEM - Check systemd state on analytics1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:19] PROBLEM - Check systemd state on mw2297 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:21] PROBLEM - Check systemd state on wdqs2007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:21] PROBLEM - Check systemd state on ms-be1040 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:21] PROBLEM - Check systemd state on elastic2079 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:23] PROBLEM - Check systemd state on rdb1011 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:23] PROBLEM - Check systemd state on cloudelastic1006 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:26] !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:47:31] PROBLEM - Check systemd state on restbase1027 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:31] PROBLEM - Check systemd state on ms-fe2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:31] PROBLEM - Check systemd state on an-worker1148 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:33] PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:35] !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mediawiki-page-content-change-enrichment: apply [15:47:37] PROBLEM - Check systemd state on an-master1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:41] PROBLEM - Check systemd state on kubernetes2008 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:49] PROBLEM - Check systemd state on mw1494 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:49] PROBLEM - Check systemd state on ms-be1043 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:49] PROBLEM - Check systemd state on dbproxy1021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:53] PROBLEM - Check systemd state on cp3053 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:53] PROBLEM - Check systemd state on cp3057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:53] PROBLEM - Check systemd state on an-db1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:59] PROBLEM - Check systemd state on parse2012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:01] PROBLEM - Check systemd state on db1103 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:03] PROBLEM - Check systemd state on elastic2082 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:05] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:13] PROBLEM - Check systemd state on netmon2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:13] PROBLEM - Check systemd state on kubernetes1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:13] PROBLEM - Check systemd state on kubernetes1018 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:13] PROBLEM - Check systemd state on mc1041 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:17] PROBLEM - Check systemd state on elastic1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:17] PROBLEM - Check systemd state on mw1459 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:17] PROBLEM - Check systemd state on mw1456 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:23] PROBLEM - Check systemd state on restbase1022 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:23] PROBLEM - Check systemd state on mw1361 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:23] PROBLEM - Check systemd state on an-worker1106 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:33] PROBLEM - Check systemd state on kafka-main2002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:35] PROBLEM - Check systemd state on mw2329 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:35] PROBLEM - Check systemd state on gitlab-runner2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:37] PROBLEM - Check systemd state on mw2353 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:39] PROBLEM - Check systemd state on mw1352 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:47] PROBLEM - Check systemd state on an-worker1142 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:51] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:53] PROBLEM - Check systemd state on restbase1021 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:15] RECOVERY - Check systemd state on sretest1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:23] RECOVERY - Check systemd state on cp4042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:27] RECOVERY - Check systemd state on cp3057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:27] RECOVERY - Check systemd state on cp3053 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:30] (03PS1) 10AOkoth: eventgate: add EventgateErrorsLoggingExternal alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [15:49:31] PROBLEM - Check systemd state on aqs1012 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:31] PROBLEM - Check systemd state on analytics1066 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:31] PROBLEM - Check systemd state on backup1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:31] PROBLEM - Check systemd state on an-worker1126 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:33] RECOVERY - Check systemd state on cp5027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:35] PROBLEM - Check systemd state on cp2032 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:39] RECOVERY - Check systemd state on mw1430 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:39] PROBLEM - Check systemd state on db1122 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:39] PROBLEM - Check systemd state on db1113 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:43] PROBLEM - Check systemd state on db2124 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:49] PROBLEM - Check systemd state on lvs3005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:51] PROBLEM - Check systemd state on ms-be2056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:55] RECOVERY - Check systemd state on mw1459 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:55] RECOVERY - Check systemd state on mw1456 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:57] PROBLEM - Check systemd state on puppetmaster1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:49:57] PROBLEM - Check systemd state on restbase2013 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:07] RECOVERY - Check systemd state on mw1433 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:09] RECOVERY - Check systemd state on cp5025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:13] RECOVERY - Check systemd state on analytics1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:13] RECOVERY - Check systemd state on parse1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:23] RECOVERY - Check systemd state on cp3059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:27] RECOVERY - Check systemd state on cp3050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:29] RECOVERY - Check systemd state on analytics1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:29] RECOVERY - Check systemd state on cp3065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:37] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateErrorsLoggingExternal alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [15:50:39] RECOVERY - Check systemd state on dumpsdata1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:39] RECOVERY - Check systemd state on dns6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:43] RECOVERY - Check systemd state on aqs1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:43] RECOVERY - Check systemd state on aqs1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:43] RECOVERY - Check systemd state on ms-fe2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:44] RECOVERY - Check systemd state on kubernetes2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:44] RECOVERY - Check systemd state on dumpsdata1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:51] RECOVERY - Check systemd state on cp5031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:51] RECOVERY - Check systemd state on parse1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:51] RECOVERY - Check systemd state on parse1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:53] RECOVERY - Check systemd state on kubernetes2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:55] RECOVERY - Check systemd state on aqs1017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:01] RECOVERY - Check systemd state on cp5030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:01] RECOVERY - Check systemd state on mw1436 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:01] RECOVERY - Check systemd state on mw1451 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:09] RECOVERY - Check systemd state on aqs1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:09] PROBLEM - Check systemd state on an-worker1130 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:11] (03PS1) 10Jcrespo: dbbackups: Setup db1145 as a backup source replacement for db1150 [puppet] - 10https://gerrit.wikimedia.org/r/901624 (https://phabricator.wikimedia.org/T332708) [15:51:15] PROBLEM - Check systemd state on db1102 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:17] RECOVERY - Check systemd state on dumpsdata1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:17] PROBLEM - Check systemd state on db1185 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:23] PROBLEM - Check systemd state on ganeti1016 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:24] (03CR) 10Dzahn: [C: 03+2] devtools hiera: Fix profile::mediawiki::scap_client::is_master setting [puppet] - 10https://gerrit.wikimedia.org/r/901309 (owner: 10Ahmon Dancy) [15:51:25] PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:26] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host dborch1002.wikimedia.org [15:51:27] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [15:51:29] RECOVERY - Check systemd state on kubernetes1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:31] RECOVERY - Check systemd state on mc1041 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:37] RECOVERY - Check systemd state on db1136 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:41] (03PS2) 10Nicolas Fraison: spark: udapte networkpolicy to authorize kubernetes-api to contact webhook service [deployment-charts] - 10https://gerrit.wikimedia.org/r/901618 [15:51:41] RECOVERY - Check systemd state on db2115 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:45] RECOVERY - Check systemd state on mw1470 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:47] RECOVERY - Check systemd state on elastic2060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:47] RECOVERY - Check systemd state on mc1050 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:49] RECOVERY - Check systemd state on mc1049 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:49] RECOVERY - Check systemd state on an-presto1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:55] RECOVERY - Check systemd state on ganeti1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:51:55] RECOVERY - Check systemd state on ganeti2030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:01] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1005.eqiad.wmnet with OS bullseye [15:52:03] RECOVERY - Check systemd state on mw2427 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:07] RECOVERY - Check systemd state on wdqs1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:09] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:13] RECOVERY - Check systemd state on wdqs2007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:17] RECOVERY - Check systemd state on ml-cache1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:19] RECOVERY - Check systemd state on ganeti2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:19] RECOVERY - Check systemd state on elastic2043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:23] RECOVERY - Check systemd state on mw2408 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:27] RECOVERY - Check systemd state on mw2423 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:27] RECOVERY - Check systemd state on mw2432 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:29] RECOVERY - Check systemd state on wdqs2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:29] RECOVERY - Check systemd state on db1182 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:29] RECOVERY - Check systemd state on an-master1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:30] !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:52:30] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache dborch1002.wikimedia.org on all recursors [15:52:33] RECOVERY - Check systemd state on db2104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:33] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dborch1002.wikimedia.org on all recursors [15:52:36] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [15:52:41] RECOVERY - Check systemd state on mw1471 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:45] RECOVERY - Check systemd state on mw1494 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:47] RECOVERY - Check systemd state on db1187 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:51] RECOVERY - Check systemd state on an-db1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:53] RECOVERY - Check systemd state on backup1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:59] PROBLEM - Check systemd state on db2150 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:59] PROBLEM - Check systemd state on db2168 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:03] RECOVERY - Check systemd state on db1185 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:07] RECOVERY - Check systemd state on db2124 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:07] RECOVERY - Check systemd state on ganeti1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:11] (03PS2) 10AOkoth: eventgate: add EventgateErrorsLoggingExternal alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [15:53:13] PROBLEM - Check systemd state on parse2010 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:15] RECOVERY - Check systemd state on mw1474 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:15] RECOVERY - Check systemd state on db2103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:15] RECOVERY - Check systemd state on kubernetes1018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:17] RECOVERY - Check systemd state on restbase2014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:19] RECOVERY - Check systemd state on elastic1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:23] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:23] RECOVERY - Check systemd state on restbase2013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:25] RECOVERY - Check systemd state on mw1361 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:27] RECOVERY - Check systemd state on db2112 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:37] RECOVERY - Check systemd state on logstash2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:37] RECOVERY - Check systemd state on mw2293 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:39] RECOVERY - Check systemd state on mw1352 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:41] RECOVERY - Check systemd state on mw2299 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:41] RECOVERY - Check systemd state on db2134 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:45] RECOVERY - Check systemd state on elastic2065 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:45] RECOVERY - Check systemd state on mw2279 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:48] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:53:48] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache dborch1002.wikimedia.org on all recursors [15:53:51] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dborch1002.wikimedia.org on all recursors [15:53:53] RECOVERY - Check systemd state on db1177 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:55] RECOVERY - Check systemd state on elastic1070 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:55] RECOVERY - Check systemd state on mw2309 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:55] RECOVERY - Check systemd state on mw2297 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:57] !log jhathaway@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host dborch1002.wikimedia.org [15:53:57] RECOVERY - Check systemd state on elastic2079 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:59] RECOVERY - Check systemd state on elastic2081 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:53:59] RECOVERY - Check systemd state on rdb1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:07] RECOVERY - Check systemd state on mw1362 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:09] RECOVERY - Check systemd state on mw1372 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:09] RECOVERY - Check systemd state on an-tool1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:13] RECOVERY - Check systemd state on an-worker1097 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:14] (03CR) 10CI reject: [V: 04-1] eventgate: add EventgateErrorsLoggingExternal alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [15:54:17] RECOVERY - Check systemd state on db2165 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:21] RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:21] (03CR) 10Hashar: "I have asked internally to the security team since I believe they have experience with CORS headers." [puppet] - 10https://gerrit.wikimedia.org/r/900663 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [15:54:27] RECOVERY - Check systemd state on db2167 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:27] RECOVERY - Check systemd state on elastic2074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:29] PROBLEM - Check systemd state on cp2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:29] RECOVERY - Check systemd state on dbproxy1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:37] PROBLEM - Check systemd state on db2182 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:39] PROBLEM - Check systemd state on kafka-logging1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:43] RECOVERY - Check systemd state on db2150 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:43] RECOVERY - Check systemd state on cp2032 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:43] RECOVERY - Check systemd state on ms-backup1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:45] RECOVERY - Check systemd state on db2168 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:47] PROBLEM - Check systemd state on mw2402 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:49] RECOVERY - Check systemd state on elastic2082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:51] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:53] RECOVERY - Check systemd state on ml-serve1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:53] RECOVERY - Check systemd state on mw2449 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:55] RECOVERY - Check systemd state on mw2313 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:55] RECOVERY - Check systemd state on mw2338 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:54:59] RECOVERY - Check systemd state on netmon2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:03] RECOVERY - Check systemd state on mw1379 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:09] RECOVERY - Check systemd state on an-worker1106 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:11] RECOVERY - Check systemd state on db2133 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:17] RECOVERY - Check systemd state on db2132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:21] RECOVERY - Check systemd state on mw2329 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:21] RECOVERY - Check systemd state on gitlab-runner2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:23] RECOVERY - Check systemd state on logstash2036 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:25] RECOVERY - Check systemd state on mw2353 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:27] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:27] RECOVERY - Check systemd state on restbase2025 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:27] RECOVERY - Check systemd state on logstash1034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:29] RECOVERY - Check systemd state on mw2337 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:31] RECOVERY - Check systemd state on mw2339 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:31] RECOVERY - Check systemd state on mw2358 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:37] RECOVERY - Check systemd state on graphite2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:39] RECOVERY - Check systemd state on restbase1021 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:45] RECOVERY - Check systemd state on mw2326 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:47] RECOVERY - Check systemd state on an-worker1082 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:53] RECOVERY - Check systemd state on dbprov1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:57] RECOVERY - Check systemd state on restbase1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:57] RECOVERY - Check systemd state on an-worker1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:05] RECOVERY - Check systemd state on restbase1019 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:07] RECOVERY - Check systemd state on db1104 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:09] RECOVERY - Check systemd state on cp6008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:13] RECOVERY - Check systemd state on elastic1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:19] RECOVERY - Check systemd state on cp2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:19] RECOVERY - Check systemd state on kafka-jumbo1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:19] RECOVERY - Check systemd state on ms-be1043 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:21] PROBLEM - Check systemd state on kafka-main1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:25] PROBLEM - Check systemd state on ms-be1060 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-ipmi-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:27] RECOVERY - Check systemd state on db2182 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:29] RECOVERY - Check systemd state on an-worker1130 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:29] RECOVERY - Check systemd state on an-worker1126 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:31] RECOVERY - Check systemd state on kafka-logging1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:37] RECOVERY - Check systemd state on db1103 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:37] RECOVERY - Check systemd state on db1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:39] RECOVERY - Check systemd state on db1122 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:39] RECOVERY - Check systemd state on db1113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:41] RECOVERY - Check systemd state on dns1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:41] RECOVERY - Check systemd state on parse2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:41] 10SRE-tools, 10Infrastructure-Foundations: cookbooks: sre.hosts.reboot-single update to support disabled puppet - https://phabricator.wikimedia.org/T325153 (10elukey) Adding more context - I needed to stop gracefully kafka on the node and I've disabled puppet to avoid getting the daemon back in running state.... [15:56:45] RECOVERY - Check systemd state on an-worker1128 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:45] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:45] RECOVERY - Check systemd state on parse1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:45] RECOVERY - Check systemd state on logstash2027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:45] RECOVERY - Check systemd state on logstash1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:47] RECOVERY - Check systemd state on mw2381 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:49] RECOVERY - Check systemd state on parse2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:49] RECOVERY - Check systemd state on clouddb1014 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:50] RECOVERY - Check systemd state on clouddb1016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:50] RECOVERY - Check systemd state on lvs3005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:53] RECOVERY - Check systemd state on db2176 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:53] RECOVERY - Check systemd state on ms-be2056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:55] RECOVERY - Check systemd state on logstash1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:57] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) @Tgr It looks like we're not directly r... [15:56:57] RECOVERY - Check systemd state on elastic1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:59] RECOVERY - Check systemd state on elastic1102 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:56:59] RECOVERY - Check systemd state on restbase1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:02] !log running from cumin1001: transfer.py --type=decompress dbprov1003.eqiad.wmnet:/srv/backups/snapshots/latest/snapshot.s5.2023-03-20--04-00-30.tar.gz db1145.eqiad.wmnet:/srv/sqldata.s5 [15:57:02] (03PS3) 10Nicolas Fraison: spark: udapte networkpolicy to authorize kubernetes-api to contact webhook service [deployment-charts] - 10https://gerrit.wikimedia.org/r/901618 (https://phabricator.wikimedia.org/T331858) [15:57:03] RECOVERY - Check systemd state on cp6002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:03] RECOVERY - Check systemd state on db2169 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:09] RECOVERY - Check systemd state on db2184 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:09] RECOVERY - Check systemd state on ms-be2062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:11] RECOVERY - Check systemd state on kafka-main2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:11] RECOVERY - Check systemd state on mw1410 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:13] RECOVERY - Check systemd state on kafka-jumbo1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:13] RECOVERY - Check systemd state on ms-be2048 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:13] RECOVERY - Check systemd state on ms-be2040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:13] RECOVERY - Check systemd state on mc2044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:13] RECOVERY - Check systemd state on db1117 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:17] RECOVERY - Check systemd state on parse2018 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:19] RECOVERY - Check systemd state on ms-be2066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:19] RECOVERY - Check systemd state on parse2008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:21] RECOVERY - Check systemd state on analytics1064 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:24] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) >>! In... [15:57:25] RECOVERY - Check systemd state on mw2389 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:25] RECOVERY - Check systemd state on lvs4009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:25] RECOVERY - Check systemd state on lvs4008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:27] RECOVERY - Check systemd state on an-worker1135 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:27] RECOVERY - Check systemd state on an-worker1142 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:35] RECOVERY - Check systemd state on elastic1089 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:37] RECOVERY - Check systemd state on ms-be1040 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:37] RECOVERY - Check systemd state on ms-be1044 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:39] RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:39] RECOVERY - Check systemd state on kafka-jumbo1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:39] RECOVERY - Check systemd state on thanos-fe1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:43] RECOVERY - Check systemd state on kafka-main2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:47] RECOVERY - Check systemd state on an-worker1129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:47] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10MoritzMuehlenhoff) >>! In T331482#8715365, @FNavas-foundation wrote: > @MoritzMuehlenhoff - alerting that my manager is back so he can sign-off should you need to conta... [15:57:47] RECOVERY - Check systemd state on parse1012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:49] RECOVERY - Check systemd state on an-worker1148 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:53] RECOVERY - Check systemd state on cp5017 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:13] RECOVERY - Check systemd state on kafka-main1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:21] RECOVERY - Check systemd state on analytics1066 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:27] RECOVERY - Check systemd state on parse2012 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:31] RECOVERY - Check systemd state on mw2402 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:42] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:00:05] jbond and rzl: OwO what's this, a deployment window?? Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1600). nyaa~ [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:52] 10SRE, 10ops-eqiad, 10DBA, 10Data-Persistence-Backup, 10Patch-For-Review: db1150 crashed: DIMM_A8 memory issues - https://phabricator.wikimedia.org/T332708 (10jcrespo) FYI: Logged but forgot to add the ticket number: ` jynus: running from cumin1001: transfer.py --type=decompress dbprov1003.eqiad.wmnet:/s... [16:01:55] (03PS2) 10Cathal Mooney: Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) [16:03:07] (03CR) 10CI reject: [V: 04-1] Alertmanager alerts/rules for irc.wikimedia.org [alerts] - 10https://gerrit.wikimedia.org/r/901599 (https://phabricator.wikimedia.org/T327793) (owner: 10Cathal Mooney) [16:03:20] * Lucas_WMDE blames TheresNoTime for that jouncebot message [16:03:52] >:D [16:04:48] 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Vgutierrez)... [16:05:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] hiera: Test maxconn per backend in cp4044 and cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/901616 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [16:06:35] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [16:07:39] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10Stevemunene) [16:08:51] 10SRE, 10Infrastructure-Foundations, 10vm-requests: eqiad: 1 VM request for hadoop-test-client - https://phabricator.wikimedia.org/T332656 (10Stevemunene) Make vm with `sudo cookbook sre.ganeti.makevm --vcpus 2 --memory 4 --disk 150 --network analytics --cluster eqiad --group C an-test-client1002` [16:10:13] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.makevm for new host an-test-client1002.eqiad.wmnet [16:10:15] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [16:11:39] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Setup db1145 as a backup source replacement for db1150 [puppet] - 10https://gerrit.wikimedia.org/r/901624 (https://phabricator.wikimedia.org/T332708) (owner: 10Jcrespo) [16:11:45] RECOVERY - Check systemd state on ms-be1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:05] !log stevemunene@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-test-client1002.eqiad.wmnet - stevemunene@cumin1001" [16:13:27] RECOVERY - Check systemd state on ms-be1060 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:53] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM an-test-client1002.eqiad.wmnet - stevemunene@cumin1001" [16:14:53] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:14:53] !log stevemunene@cumin1001 START - Cookbook sre.dns.wipe-cache an-test-client1002.eqiad.wmnet on all recursors [16:14:57] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-test-client1002.eqiad.wmnet on all recursors [16:15:40] (03PS3) 10AOkoth: eventgate: add EventgateErrorsLoggingExternal alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [16:19:31] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, let's sync up tomorrow to deploy and do a test." [deployment-charts] - 10https://gerrit.wikimedia.org/r/901572 (https://phabricator.wikimedia.org/T325139) (owner: 10EoghanGaffney) [16:22:13] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Test maxconn per backend in cp4044 and cp4052 [puppet] - 10https://gerrit.wikimedia.org/r/901616 (https://phabricator.wikimedia.org/T310609) (owner: 10Vgutierrez) [16:25:47] (03CR) 10Cwhite: [C: 03+2] logstash: sample high-volume rdbms lib logging [puppet] - 10https://gerrit.wikimedia.org/r/900718 (https://phabricator.wikimedia.org/T332228) (owner: 10Cwhite) [16:26:22] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/901618 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [16:27:07] (03CR) 10Nicolas Fraison: [C: 03+2] spark: udapte networkpolicy to authorize kubernetes-api to contact webhook service [deployment-charts] - 10https://gerrit.wikimedia.org/r/901618 (https://phabricator.wikimedia.org/T331858) (owner: 10Nicolas Fraison) [16:27:52] (03PS6) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [16:28:37] !log upload prometheus-ipmi-exporter_1.6.1 to bullseye [16:28:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:29:43] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40257/console" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:30:04] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [16:30:23] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [16:30:24] (03CR) 10Jbond: "updated see:" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:30:32] (03CR) 10Filippo Giunchedi: "Looks good overall! See inline" [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [16:32:02] (03PS2) 10BCornwall: Import Host Overview dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901301 (https://phabricator.wikimedia.org/T331656) [16:33:09] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye [16:35:39] (03CR) 10Filippo Giunchedi: "Nice, thank you! See inline" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:36:15] (03CR) 10Jbond: [C: 03+1] "lgtm once approved" [puppet] - 10https://gerrit.wikimedia.org/r/901614 (https://phabricator.wikimedia.org/T331647) (owner: 10Muehlenhoff) [16:36:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:39:16] (03PS3) 10BCornwall: Import Host Overview dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901301 (https://phabricator.wikimedia.org/T331656) [16:39:57] (03PS1) 10Ahmon Dancy: Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) [16:40:00] (03PS1) 10Nicolas Fraison: spark: typo webhook service [deployment-charts] - 10https://gerrit.wikimedia.org/r/901668 [16:40:11] (03CR) 10CI reject: [V: 04-1] Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:40:57] (03PS2) 10Ahmon Dancy: Add setting to make /srv/mediawiki -> /srv/mediawiki-staging on deploy servers [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) [16:41:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:42:40] (03PS1) 10Btullis: Upload the spark3-assemly file to HDFS on the test cluster [puppet] - 10https://gerrit.wikimedia.org/r/901670 (https://phabricator.wikimedia.org/T295072) [16:43:04] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.makevm for new host dborch1002.wikimedia.org [16:43:05] !log jhathaway@cumin1001 START - Cookbook sre.dns.netbox [16:43:06] (03CR) 10Dzahn: "upgrading of doc hosts is a task in the scope of the current SRE sprint week" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar) [16:43:19] (03CR) 10BCornwall: "Dashboard/000000377 changes detected:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901301 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [16:43:27] (03PS7) 10Jbond: P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 [16:43:48] (03PS13) 10Elukey: services: add the first lift wing stream to change-prop [deployment-charts] - 10https://gerrit.wikimedia.org/r/886918 (https://phabricator.wikimedia.org/T328576) [16:43:50] (03PS1) 10Elukey: ml-services: add autoscaling settings for enwiki drafttopic [deployment-charts] - 10https://gerrit.wikimedia.org/r/901671 (https://phabricator.wikimedia.org/T328576) [16:43:56] 10SRE, 10Traffic: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) a:03ssingh [16:44:49] 10SRE, 10Traffic: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10ssingh) a:03ssingh [16:45:07] !log jhathaway@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dborch1002.wikimedia.org - jhathaway@cumin1001" [16:45:53] (03CR) 10Ahmon Dancy: "PCC results: https://puppet-compiler.wmflabs.org/output/901667/40258/" [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:46:15] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM dborch1002.wikimedia.org - jhathaway@cumin1001" [16:46:15] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:46:15] !log jhathaway@cumin1001 START - Cookbook sre.dns.wipe-cache dborch1002.wikimedia.org on all recursors [16:46:18] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) dborch1002.wikimedia.org on all recursors [16:46:38] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40260/console" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:46:52] (03CR) 10Ahmon Dancy: "I'll make a separate commit to enable the setting." [puppet] - 10https://gerrit.wikimedia.org/r/901667 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [16:47:26] (03CR) 10Jbond: "updated: https://puppet-compiler.wmflabs.org/output/901586/40260/an-airflow1004.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:48:45] (03CR) 10Dzahn: [C: 03+2] miscweb: switch annual and bienvenida microsites to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901318 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [16:49:19] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Neat" [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:49:30] (03CR) 10Dzahn: [C: 03+2] miscweb: switch tendril and dbtree microsites to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901319 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [16:49:51] (03PS2) 10Dzahn: miscweb: switch tendril and dbtree microsites to miscweb2003 [puppet] - 10https://gerrit.wikimedia.org/r/901319 (https://phabricator.wikimedia.org/T331896) [16:50:59] !log copy /usr/bin/prometheus-ipmi-exporter from bullseye to buster [16:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:48] 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10RBrounley_WMF) Hello all - apologies for delay. Back from holiday today. Approved! He's doing analysis for a feature we're working on for Breaking News detection. [16:52:35] (03CR) 10Btullis: [C: 04-1] "File locations mentioned in the script are incorrect. Investigating." [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [16:53:48] !log sudo cumin -b 4 -s 40 'C:role::cache::text' 'run-puppet-agent' [16:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:58] (03PS2) 10Volans: es_exporter: add NEL metrics by country [puppet] - 10https://gerrit.wikimedia.org/r/901220 (https://phabricator.wikimedia.org/T328941) [16:53:59] (03PS1) 10Volans: ipmi: add configuration for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/901674 (https://phabricator.wikimedia.org/T253810) [16:54:41] (03PS2) 10Volans: ipmi: add configuration for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/901674 (https://phabricator.wikimedia.org/T253810) [16:54:52] (03CR) 10Volans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/901674 (https://phabricator.wikimedia.org/T253810) (owner: 10Volans) [16:55:08] (03PS1) 10Elukey: services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 [16:55:46] (03CR) 10Jbond: [C: 03+2] P:contacts: add role owner metric [puppet] - 10https://gerrit.wikimedia.org/r/901586 (owner: 10Jbond) [16:57:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/901674 (https://phabricator.wikimedia.org/T253810) (owner: 10Volans) [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1700) [17:00:28] (03CR) 10Volans: [C: 03+2] ipmi: add configuration for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/901674 (https://phabricator.wikimedia.org/T253810) (owner: 10Volans) [17:00:30] (03PS2) 10Elukey: services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 [17:00:36] (03PS3) 10Volans: ipmi: add configuration for ipmiseld [puppet] - 10https://gerrit.wikimedia.org/r/901674 (https://phabricator.wikimedia.org/T253810) [17:01:50] (03CR) 10Elukey: "Example of internal call:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey) [17:02:52] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:32] (03CR) 10Klausman: [C: 03+1] services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey) [17:04:35] (03PS1) 10Ahmon Dancy: DNM: Remove deploy1002/deploy2002 from mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/901676 [17:04:59] (03CR) 10AOkoth: eventgate: add EventgateErrorsLoggingExternal alert (033 comments) [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:05:01] (03CR) 10Nicolas Fraison: [C: 03+2] spark: typo webhook service [deployment-charts] - 10https://gerrit.wikimedia.org/r/901668 (owner: 10Nicolas Fraison) [17:05:34] (03PS4) 10AOkoth: eventgate: add EventgateErrorsLoggingExternal alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [17:07:01] (03PS15) 10Slyngshede: Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 [17:07:02] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [17:07:21] !log nfraison@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [17:07:56] jbond: ok to merge yours? P:contacts: add role owner metric (261fd8fa97) [17:08:23] (03CR) 10Filippo Giunchedi: "See inline, please also add a test" [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:08:35] (03PS1) 10Dzahn: miscweb/iegreview: set custom log, don't log into "other_vhosts" file [puppet] - 10https://gerrit.wikimedia.org/r/901677 [17:09:07] (03CR) 10CI reject: [V: 04-1] Squid logformat to ECS [puppet] - 10https://gerrit.wikimedia.org/r/901544 (owner: 10Slyngshede) [17:10:43] (03PS1) 10Dzahn: miscweb/annualreport: use non-generic custom log file name [puppet] - 10https://gerrit.wikimedia.org/r/901678 [17:12:18] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:22] (03PS1) 10Giuseppe Lavagetto: mesh.configuration: add support for custom error pages [deployment-charts] - 10https://gerrit.wikimedia.org/r/901679 (https://phabricator.wikimedia.org/T287983) [17:13:23] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Volans) 05Resolved→03Open [17:13:36] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Volans) 05Open→03Resolved Summary of today's work: 1) `ipmi_exporter` run comm... [17:14:19] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Volans) a:05Volans→03None [17:15:10] (03PS2) 10Ahmon Dancy: Experiment: Remove deploy1002/deploy2002 from mediawiki-installation dsh group [puppet] - 10https://gerrit.wikimedia.org/r/901676 (https://phabricator.wikimedia.org/T329857) [17:16:42] (03CR) 10Ahmon Dancy: "Test plan: Merge this. Run puppet on deploy1002/deploy2002. Greg for deploy1002 and deploy2002 in /etc/dsh/group/mediawiki-installation. " [puppet] - 10https://gerrit.wikimedia.org/r/901676 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [17:16:52] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup): Report problems found in server's IPMI SEL - https://phabricator.wikimedia.org/T197084 (10Volans) [17:17:05] 10SRE-Sprint-Week-Sustainability-March2023, 10SRE Observability, 10observability, 10Sustainability (Incident Followup), 10User-MoritzMuehlenhoff: Alert on ECC warnings in SEL - https://phabricator.wikimedia.org/T253810 (10Volans) [17:17:07] (03PS1) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) [17:17:09] (03CR) 10Hnowlan: [C: 03+1] services: tweak lift wing endpoints to allow wikidata-specific endpoints [deployment-charts] - 10https://gerrit.wikimedia.org/r/901675 (owner: 10Elukey) [17:17:11] (03PS2) 10Hashar: build: add local typos check [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) [17:17:13] (03PS1) 10Hashar: (DO NOT SUBMIT) test typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901681 (https://phabricator.wikimedia.org/T332121) [17:17:56] (03CR) 10Hashar: "Child change Idf128c23a6fccb5d211a9c4d48745e47288c82cb introduces a typo and should fail as a result" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:17:56] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:00] (03CR) 10CI reject: [V: 04-1] (DO NOT SUBMIT) test typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901681 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:20:15] (03CR) 10Jbond: [C: 04-1] "-1 untill software is updated" [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [17:20:56] (03CR) 10Ahmon Dancy: "PCC: https://puppet-compiler.wmflabs.org/output/901676/40261/" [puppet] - 10https://gerrit.wikimedia.org/r/901676 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [17:23:05] 10SRE, 10Commons, 10MediaWiki-File-management, 10StructuredDataOnCommons, and 3 others: Frequent "Error: 429, Too Many Requests" errors on pages with many (>50) thumbnails - https://phabricator.wikimedia.org/T266155 (10doctaxon) @TheDJ thanks a lot [17:23:12] (03PS2) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) [17:25:16] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host an-test-client1002.eqiad.wmnet [17:27:02] (03PS1) 10Stevemunene: Add hadoop-test-client [puppet] - 10https://gerrit.wikimedia.org/r/901682 (https://phabricator.wikimedia.org/T332656) [17:27:12] (03PS4) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [17:27:28] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:22] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [17:29:07] (03PS5) 10Effie Mouzeli: maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) [17:30:23] (03CR) 10CI reject: [V: 04-1] maps: add alerting for OSM sync [alerts] - 10https://gerrit.wikimedia.org/r/901563 (https://phabricator.wikimedia.org/T285328) (owner: 10Effie Mouzeli) [17:32:43] (03PS3) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) [17:32:54] (03CR) 10Btullis: [C: 03+1] Add hadoop-test-client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901682 (https://phabricator.wikimedia.org/T332656) (owner: 10Stevemunene) [17:33:13] (03Abandoned) 10Hashar: (DO NOT SUBMIT) test typos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901681 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:33:24] (03CR) 10Hashar: "That fails as expected on the child change https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/901681 ;)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:33:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40265/console" [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [17:35:04] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:37:16] (03PS5) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) [17:38:28] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:51] (03CR) 10AOkoth: eventgate: add EventgateLoggingExternalErrors alert (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/901623 (https://phabricator.wikimedia.org/T309009) (owner: 10AOkoth) [17:38:54] (03PS4) 10Jbond: prometheus::ipmi_exporter: update config to use inbuilt sudo option [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) [17:39:52] !log joal@deploy2002 Started deploy [airflow-dags/analytics@e7b1d0b]: Fix analytics HDFSArchiver tasks [airflow-dags/analytics@e7b1d0b] [17:40:02] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40266/console" [puppet] - 10https://gerrit.wikimedia.org/r/901680 (https://phabricator.wikimedia.org/T253810) (owner: 10Jbond) [17:40:04] !log joal@deploy2002 Finished deploy [airflow-dags/analytics@e7b1d0b]: Fix analytics HDFSArchiver tasks [airflow-dags/analytics@e7b1d0b] (duration: 00m 11s) [17:40:44] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:41:24] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01028 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [17:45:10] hopefully fix incomming [17:45:20] (03PS1) 10Jbond: P:contact: Add fix for unowned roles [puppet] - 10https://gerrit.wikimedia.org/r/901683 [17:47:53] (03PS1) 10BCornwall: Import confd dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901684 (https://phabricator.wikimedia.org/T331656) [17:47:54] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:49] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host dborch1002.wikimedia.org [17:49:19] (03CR) 10BCornwall: "Diff:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901684 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [17:50:43] (03CR) 10Bartosz Dziewoński: [C: 03+1] "I cna confirm that `composer run typos` works as expected for me on Windows." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/899520 (https://phabricator.wikimedia.org/T332121) (owner: 10Hashar) [17:51:49] (03PS2) 10Jbond: P:contact: Add fix for unowned roles [puppet] - 10https://gerrit.wikimedia.org/r/901683 [17:51:58] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:53:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40268/console" [puppet] - 10https://gerrit.wikimedia.org/r/901683 (owner: 10Jbond) [17:53:30] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:contact: Add fix for unowned roles [puppet] - 10https://gerrit.wikimedia.org/r/901683 (owner: 10Jbond) [17:56:58] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s-dse@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-dse - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [17:58:26] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004895 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [18:00:04] dancy and brennen: That opportune time is upon us again. Time for a MediaWiki train - Utc-7 Version deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T1800). [18:01:16] (03PS2) 10BCornwall: Import confd dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901684 (https://phabricator.wikimedia.org/T331656) [18:01:18] (03PS1) 10BCornwall: Import application servers RED dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901685 (https://phabricator.wikimedia.org/T331656) [18:01:24] g [18:01:44] l [18:01:54] :( [18:02:02] ; have fun [18:02:09] brett: thanks, needed it :P [18:02:24] *rampage* [18:03:24] (03CR) 10Herron: [C: 03+1] "thanks for the diff!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901301 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [18:05:46] (03CR) 10BCornwall: "Dashboard/RIA1lzDZk changes detected:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901685 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [18:08:40] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:09:20] (03CR) 10Herron: [C: 03+1] Import confd dashboard (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901684 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [18:15:24] (03CR) 10Stevemunene: Add hadoop-test-client (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901682 (https://phabricator.wikimedia.org/T332656) (owner: 10Stevemunene) [18:16:14] 10SRE, 10ops-eqiad: ps1-d1-eqiad and ps1-d6-eqiad down - https://phabricator.wikimedia.org/T332641 (10Jclark-ctr) @Cmjohnson i changed the cable out just now but have to step out. of data center i can continue to look at it later if its still bad [18:16:25] (03PS1) 10BCornwall: Import Kafka dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901687 (https://phabricator.wikimedia.org/T331656) [18:17:20] (03CR) 10Btullis: [C: 04-1] "Oh, we don't have this jar available because our spark distribution is only installed a pyspark." [puppet] - 10https://gerrit.wikimedia.org/r/901604 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [18:17:33] (03CR) 10BCornwall: "Dashboard/000000027 changes detected:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901687 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [18:18:08] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:19:08] (03PS2) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) [18:21:23] (03PS3) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) [18:22:51] o/ [18:22:56] Let's do this thing [18:23:30] ooh a .1. how fun [18:24:22] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Import Host Overview dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901301 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [18:28:27] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901689 (https://phabricator.wikimedia.org/T330207) [18:28:29] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901689 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [18:29:15] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.1 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901689 (https://phabricator.wikimedia.org/T330207) (owner: 10TrainBranchBot) [18:33:28] (03CR) 10Stevemunene: [C: 03+2] Add hadoop-test-client [puppet] - 10https://gerrit.wikimedia.org/r/901682 (https://phabricator.wikimedia.org/T332656) (owner: 10Stevemunene) [18:36:23] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.1 refs T330207 [18:36:29] T330207: 1.41.0-wmf.1 deployment blockers - https://phabricator.wikimedia.org/T330207 [18:38:42] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye [18:38:56] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:44:13] (03PS1) 10JHathaway: Add a dborch vm for testing the bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/901692 (https://phabricator.wikimedia.org/T298959) [18:45:46] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Import confd dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901684 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [18:47:01] (03CR) 10JHathaway: [C: 03+2] Add a dborch vm for testing the bullseye upgrade [puppet] - 10https://gerrit.wikimedia.org/r/901692 (https://phabricator.wikimedia.org/T298959) (owner: 10JHathaway) [18:48:26] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:48:34] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Jcross) Approved [18:51:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:52:34] !log jhathaway@cumin1001 START - Cookbook sre.ganeti.reimage for host dborch1002.wikimedia.org with OS bullseye [18:53:07] (03CR) 10Jaime Nuche: [C: 03+1] "LGTM. Seems we'll require a follow-up patch to remove class scap::l10nupdate" [puppet] - 10https://gerrit.wikimedia.org/r/896318 (owner: 10Majavah) [18:54:00] (03PS1) 10BCornwall: mail: Remove ID [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901694 (https://phabricator.wikimedia.org/T332445) [18:54:40] (03CR) 10BCornwall: "Before:" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901694 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [18:56:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [18:57:50] RECOVERY - Host ps1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 2.33 ms [18:59:26] PROBLEM - ps1-d1-eqiad-infeed-load-tower-A-phase-Y on ps1-d1-eqiad is CRITICAL: CRITICAL - Plugin timed out while executing system call https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:01:06] PROBLEM - Host ps1-d1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [19:01:26] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dborch1002.wikimedia.org with reason: host reimage [19:02:24] (03PS1) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901697 (https://phabricator.wikimedia.org/T315353) [19:02:51] (03CR) 10Herron: [C: 03+1] mail: Remove ID [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901694 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [19:03:27] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e7b1d0b]: initial deployment of glent dag [19:03:42] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e7b1d0b]: initial deployment of glent dag (duration: 00m 14s) [19:04:41] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dborch1002.wikimedia.org with reason: host reimage [19:05:06] RECOVERY - Host ps1-d6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.59 ms [19:07:06] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:58] !log dancy@deploy2002 Installing scap version "4.47.1" for 587 hosts [19:08:44] RECOVERY - Host ps1-d1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [19:09:08] !log dancy@deploy2002 Installation of scap version "4.47.1" completed for 587 hosts [19:10:06] (03PS1) 10Cwhite: logstash: add mmkubernetes ECS early-stage filter [puppet] - 10https://gerrit.wikimedia.org/r/901630 (https://phabricator.wikimedia.org/T234565) [19:14:45] (03PS1) 10Cwhite: logstash: add k8s statsd-exporter ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901631 (https://phabricator.wikimedia.org/T234565) [19:17:10] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [19:17:17] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [19:18:55] (03PS7) 10Jbond: Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [19:19:00] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:19:15] 10SRE, 10ops-eqiad: ps1-d1-eqiad and ps1-d6-eqiad down - https://phabricator.wikimedia.org/T332641 (10Jclark-ctr) 05Open→03Resolved a:05Cmjohnson→03Jclark-ctr Rebooted msw in rack d1 ,d6 looks to recovered [19:20:00] (03PS8) 10Jbond: Allow hive on bullseye to install and use the correct packages [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [19:22:29] (03CR) 10Jbond: "lgtm, minor nit" [puppet] - 10https://gerrit.wikimedia.org/r/901595 (https://phabricator.wikimedia.org/T329363) (owner: 10Muehlenhoff) [19:23:29] (03CR) 10Jbond: [C: 03+1] "lgtm i also rebased on moritz change to enforce the dependency" [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [19:25:18] (03CR) 10Jbond: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 27): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40269/console" [puppet] - 10https://gerrit.wikimedia.org/r/901559 (https://phabricator.wikimedia.org/T329363) (owner: 10Btullis) [19:32:08] RECOVERY - ps1-d1-eqiad-infeed-load-tower-A-phase-Y on ps1-d1-eqiad is OK: SNMP OK - ps1-d1-eqiad-infeed-load-tower-A-phase-Y 414 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:39:08] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:40:35] (03PS2) 10Bartosz Dziewoński: Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900331 (owner: 10Esanders) [19:40:44] (03PS2) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901697 (https://phabricator.wikimedia.org/T315353) [19:41:26] !log jhathaway@cumin1001 END (ERROR) - Cookbook sre.ganeti.reimage (exit_code=97) for host dborch1002.wikimedia.org with OS bullseye [19:43:50] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye [19:44:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host thanos-fe1004.eqiad.wmnet with OS bullseye [19:44:10] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host thanos-fe1004.eqiad.wmnet with OS bullseye [19:48:36] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:50:16] (03CR) 10BCornwall: [V: 03+2 C: 03+2] mail: Remove ID [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901694 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [19:50:36] (03PS2) 10BCornwall: mail: Remove ID [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901694 (https://phabricator.wikimedia.org/T332445) [19:50:39] (03CR) 10BCornwall: [V: 03+2] mail: Remove ID [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901694 (https://phabricator.wikimedia.org/T332445) (owner: 10BCornwall) [19:52:26] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-client1002.eqiad.wmnet with OS bullseye [19:59:49] (03PS1) 10Cwhite: logstash: add config-check target [puppet] - 10https://gerrit.wikimedia.org/r/901633 [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: How many deployers does it take to do UTC late backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230321T2000). [20:00:05] MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] hiii [20:00:23] o/ I can deploy [20:01:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900331 (owner: 10Esanders) [20:01:42] thanks taavi. sorry for the delay, i promised you this yesterday ;) [20:01:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901697 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:02:30] (03Merged) 10jenkins-bot: Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/900331 (owner: 10Esanders) [20:02:32] (03Merged) 10jenkins-bot: Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901697 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [20:02:48] (03Abandoned) 10Hashar: contint: allow CORS header for Zuul change status [puppet] - 10https://gerrit.wikimedia.org/r/900663 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [20:02:55] !log taavi@deploy2002 Started scap: Backport for [[gerrit:900331|Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing]], [[gerrit:901697|Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis (T315353)]] [20:03:01] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:03:08] (03PS2) 10Cwhite: logstash: add config-check target [puppet] - 10https://gerrit.wikimedia.org/r/901633 [20:04:27] !log taavi@deploy2002 esanders and taavi and matmarex: Backport for [[gerrit:900331|Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing]], [[gerrit:901697|Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis (T315353)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:04:50] MatmaRex: can you test the group2 patch? the other one looks beta-only which can't be tested here [20:05:12] yeah [20:07:28] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:38] (03CR) 10Cwhite: [C: 03+2] logstash: add config-check target [puppet] - 10https://gerrit.wikimedia.org/r/901633 (owner: 10Cwhite) [20:09:02] taavi: i can't seem to get it to work. i'm not sure if i forgot something [20:09:11] want me to revert? [20:09:19] or can I help somehow troubleshooting? [20:09:22] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye [20:10:05] it should be working though. give me a minute please [20:10:14] what i'm doing is: [20:10:30] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-client1002.eqiad.wmnet with OS bullseye [20:10:31] i'm looking at this page: https://en.wikipedia.org/wiki/Special:FindComment?idorname=c-Matma_Rex-20230321200800-Test and expecting it to find this comment i just made: https://en.wikipedia.org/wiki/User_talk:Matma_Rex#c-Matma_Rex-20230321200800-Test [20:11:08] i guess it's possible that the backend code that would index it is not running on the mwdebug servers, even though i am using them? [20:11:37] does the indexing happen in a job? [20:12:45] it happens in a RevisionDataUpdates hook, so… who knows. but it's likely [20:13:20] the same thing works on testwiki, where this has been live for a while: https://test.wikipedia.org/wiki/Special:FindComment?idorname=c-Matma_Rex-20230321201100-Test [20:13:51] we can sync it out and see what happens, at least I'm not seeing any errors [20:14:33] sounds good to me. i'm expecting that it's a job queue thing and that it will work just fine [20:15:11] sure, syncing [20:18:50] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:35] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:900331|Enable DiscussionTools_visualenhancements_newsectionlink_enable on labs for testing]], [[gerrit:901697|Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis (T315353)]] (duration: 17m 40s) [20:20:41] T315353: Create database tables for permalinks in production wikis, and enable the feature - https://phabricator.wikimedia.org/T315353 [20:20:47] MatmaRex: synced, try now? [20:21:51] looking [20:24:10] it's not working :( [20:24:35] oh no [20:24:37] can you leave it deployed for a couple minutes before reverting? i'll try to test some other wikis [20:24:42] sure [20:24:46] or maybe there's a typo somewhere [20:25:42] somehow the diffConfig https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/901697/ for shows no difference [20:25:54] maybe you just need to have it as default => true now? [20:26:19] (03CR) 10Herron: [C: 03+1] Import application servers RED dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901685 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [20:26:40] i guess that would be the same [20:26:53] but why would group2 not be valid there? [20:26:59] especially when group0 and group1 worked [20:27:20] no clue [20:27:31] i'll write the patch [20:28:34] (03PS1) 10Bartosz Dziewoński: Simplify/Fix wgDiscussionToolsEnablePermalinksBackend config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901703 [20:28:40] taavi: ^ [20:29:08] i didn't make a typo in group2 or anything else, right? [20:29:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901703 (owner: 10Bartosz Dziewoński) [20:29:58] (03CR) 10Herron: [C: 03+1] Import Kafka dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901687 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [20:30:01] (03Merged) 10jenkins-bot: Simplify/Fix wgDiscussionToolsEnablePermalinksBackend config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901703 (owner: 10Bartosz Dziewoński) [20:30:24] !log taavi@deploy2002 Started scap: Backport for [[gerrit:901703|Simplify/Fix wgDiscussionToolsEnablePermalinksBackend config]] [20:30:43] taavi: i think i see why. in MWMultiVersion.php, under DB_LISTS, group2 is not listed [20:30:59] interesting [20:31:15] but i am not the only one to try using it in config [20:31:17] there's one in InitialiseSettings-labs.php too [20:31:22] although it looks like a no-op placeholder [20:31:59] !log taavi@deploy2002 matmarex and taavi: Backport for [[gerrit:901703|Simplify/Fix wgDiscussionToolsEnablePermalinksBackend config]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [20:32:07] maybe worth filing a task to clarify/document that [20:32:25] anyway, the update is on mwdebug*, try now and see what happens? [20:33:20] looking [20:33:59] taavi: yep, works now [20:34:04] (after null-editing a page) [20:34:06] cool, syncing [20:35:08] i'll file a task about the group2 thing once we're done [20:39:25] !log taavi@deploy2002 Finished scap: Backport for [[gerrit:901703|Simplify/Fix wgDiscussionToolsEnablePermalinksBackend config]] (duration: 09m 01s) [20:39:36] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:39:43] it's live [20:40:17] thanks :o [20:40:26] for the maintenance scripts, I'm not sure what exactly amir means with "make sure it doesn't conflict" [20:41:09] oh hmm. i haven't seen his comment [20:41:47] but i'd guess that he means that it's not writing to the same tables? (it's not) or could he mean the same db sections? [20:41:55] Amir1: hi, are you here? [20:42:07] MatmaRex: somewhat [20:42:34] read it now [20:42:38] Amir1: what do you mean by "doesn't conflict" in your comment at https://phabricator.wikimedia.org/T315510#8716277 20 minutes ago? :) [20:42:59] I mean there are write heavy maint scripts running on on s8 right now [20:43:17] adding more scripts on s8 could lead to replication not being able to catch up [20:43:23] and wikis going read only [20:43:40] and also it would make both scripts get much slower [20:43:43] luckily i have nothing for s8 [20:43:54] yeah, but s8 is not the only one [20:43:55] are other db sections fine to do this on? [20:43:59] so you're saying we should only start this script on wikis where the tasks you just mentioned are finished / not yet running? [20:44:12] taavi: yes [20:45:45] hm [20:46:08] (i'm completely fine with that, this is going to take a long time anyway) [20:46:27] at least s7 seems ok from that perspective right now, not sure about others [20:47:04] yeah, s7 is good but we need to have a central place to keep track of this [20:47:26] (ofc don't run it on all wikis of s7 in parallel, just all wikis of s7) [20:47:51] yeah ofc :D [20:47:52] https://phabricator.wikimedia.org/T315510#8716317 [20:48:09] awesome. thanks [20:48:10] thank you both :) [20:48:57] !log start T315510 migration script on group2 s7 wikis [20:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:03] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [20:49:04] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:09] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host thanos-fe1004.eqiad.wmnet with OS bullseye [20:59:04] i assume everything is fine with the deployment (and the script). i'll be around for a while longer just in case. thanks for deploying! [21:00:32] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Import Kafka dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901687 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [21:00:39] (03CR) 10BCornwall: [V: 03+2 C: 03+2] Import application servers RED dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901685 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [21:00:43] (03PS2) 10BCornwall: Import application servers RED dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901685 (https://phabricator.wikimedia.org/T331656) [21:00:56] (03PS2) 10BCornwall: Import Kafka dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901687 (https://phabricator.wikimedia.org/T331656) [21:05:24] PROBLEM - orchestrator.wikimedia.org tls expiry on dborch1002 is CRITICAL: connect to address 208.80.154.77 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:05:34] PROBLEM - Check systemd state on dborch1002 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,orchestrator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:38] PROBLEM - orchestrator.wikimedia.org requires authentication on dborch1002 is CRITICAL: connect to address 208.80.154.77 and port 443: Connection refused https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:05:50] (03CR) 10BCornwall: [V: 03+2] Import application servers RED dashboard [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901685 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [21:06:06] PROBLEM - orchestrator process on dborch1002 is CRITICAL: PROCS CRITICAL: 0 processes with regex args orchestrator http https://wikitech.wikimedia.org/wiki/Orchestrator [21:06:13] jhathaway: maybe not quite working ^ [21:06:20] PROBLEM - orchestrator TCP port on dborch1002 is CRITICAL: connect to address 127.0.0.1 and port 3000: Connection refused https://wikitech.wikimedia.org/wiki/Orchestrator [21:07:56] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:10:34] jhathaway: those alerts are most likely expected. you can probably ack all of them and I can do some further tests tomorrow [21:16:53] 10SRE-Sprint-Week-Sustainability-March2023, 10Observability-Logging, 10Wikimedia-Logstash, 10observability, 10Sustainability (Incident Followup): Emergency response to logstash being backlogged - https://phabricator.wikimedia.org/T233735 (10colewhite) [21:17:32] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10phaultfinder) [21:19:16] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:20:21] (03CR) 10Bartosz Dziewoński: Enable wgDiscussionToolsEnablePermalinksBackend on group2 wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901697 (https://phabricator.wikimedia.org/T315353) (owner: 10Bartosz Dziewoński) [21:20:45] (03CR) 10Bartosz Dziewoński: Simplify/Fix wgDiscussionToolsEnablePermalinksBackend config (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901703 (owner: 10Bartosz Dziewoński) [21:21:35] !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye [21:26:10] marostegui: will do, thanks [21:30:06] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host an-test-client1002.eqiad.wmnet with OS bullseye [21:32:39] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) @jsn.sherman I guess that could explain why the... [21:33:51] (03PS1) 10JHathaway: dborch: allow dborch1002 to issue an ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/901709 (https://phabricator.wikimedia.org/T298959) [21:34:28] (03CR) 10Cwhite: "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901322 (https://phabricator.wikimedia.org/T330165) (owner: 10Tim Starling) [21:35:51] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40270/console" [puppet] - 10https://gerrit.wikimedia.org/r/901709 (https://phabricator.wikimedia.org/T298959) (owner: 10JHathaway) [21:36:50] (03CR) 10JHathaway: [V: 03+1 C: 03+2] dborch: allow dborch1002 to issue an ssl cert [puppet] - 10https://gerrit.wikimedia.org/r/901709 (https://phabricator.wikimedia.org/T298959) (owner: 10JHathaway) [21:38:14] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:39:24] 10SRE, 10Traffic-Icebox, 10WMF-General-or-Unknown: Varnish: Mobile site redirect interferes with OAuth authorization process - https://phabricator.wikimedia.org/T74186 (10Tgr) As pointed out in T332650#8715452, if the user is not logged in, they get redirected to the login form, and then back via `returnto`/... [21:43:12] RECOVERY - orchestrator.wikimedia.org tls expiry on dborch1002 is OK: OK - Certificate orchestrator.wikimedia.org will expire on Mon 19 Jun 2023 08:41:15 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:43:26] RECOVERY - orchestrator.wikimedia.org requires authentication on dborch1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 596 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [21:49:30] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:54] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Gerrit, 10serviceops-collab, and 3 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10Dzahn) @hashar @Jelto Do you have any comments on the question by Volans above? [22:01:31] (03PS1) 10Urbanecm: [Growth] eswiki: Enable mentorship for 35% newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901712 (https://phabricator.wikimedia.org/T332737) [22:01:36] jouncebot: nowandnext [22:01:36] No deployments scheduled for the next 7 hour(s) and 58 minute(s) [22:01:36] In 7 hour(s) and 58 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T0600) [22:02:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901712 (https://phabricator.wikimedia.org/T332737) (owner: 10Urbanecm) [22:02:52] (03Merged) 10jenkins-bot: [Growth] eswiki: Enable mentorship for 35% newcomers [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901712 (https://phabricator.wikimedia.org/T332737) (owner: 10Urbanecm) [22:03:03] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/900465/40271/" [puppet] - 10https://gerrit.wikimedia.org/r/900465 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [22:03:14] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:901712|[Growth] eswiki: Enable mentorship for 35% newcomers (T332737 T285235)]] [22:03:21] T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235 [22:03:22] T332737: Increase percentage of newcomers who receive Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T332737 [22:03:46] (03PS2) 10Dzahn: miscweb: add miscweb1003/2003 to rsync_dst_hosts [puppet] - 10https://gerrit.wikimedia.org/r/900465 (https://phabricator.wikimedia.org/T331896) [22:04:50] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:901712|[Growth] eswiki: Enable mentorship for 35% newcomers (T332737 T285235)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:07:48] (03CR) 10Dzahn: [C: 03+2] miscweb/iegreview: set custom log, don't log into "other_vhosts" file [puppet] - 10https://gerrit.wikimedia.org/r/901677 (owner: 10Dzahn) [22:07:53] (03PS2) 10Dzahn: miscweb/iegreview: set custom log, don't log into "other_vhosts" file [puppet] - 10https://gerrit.wikimedia.org/r/901677 [22:08:26] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:10:29] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:901712|[Growth] eswiki: Enable mentorship for 35% newcomers (T332737 T285235)]] (duration: 07m 15s) [22:10:36] T285235: Activate Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T285235 [22:10:37] T332737: Increase percentage of newcomers who receive Growth mentorship at Spanish Wikipedia - https://phabricator.wikimedia.org/T332737 [22:10:47] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) @Tgr I should have checked through the... [22:11:00] (03PS3) 10Dzahn: miscweb/iegreview: set custom log, don't log into "other_vhosts" file [puppet] - 10https://gerrit.wikimedia.org/r/901677 (https://phabricator.wikimedia.org/T331896) [22:12:16] (03PS2) 10Dzahn: miscweb/annualreport: use non-generic custom log file name [puppet] - 10https://gerrit.wikimedia.org/r/901678 [22:13:00] (03PS3) 10Dzahn: miscweb/annualreport: use non-generic custom log file name [puppet] - 10https://gerrit.wikimedia.org/r/901678 (https://phabricator.wikimedia.org/T331896) [22:17:23] (03CR) 10Dzahn: [C: 03+2] miscweb/annualreport: use non-generic custom log file name [puppet] - 10https://gerrit.wikimedia.org/r/901678 (https://phabricator.wikimedia.org/T331896) (owner: 10Dzahn) [22:17:52] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:21:42] (03CR) 10Dzahn: "SRE has "sprint week" this week where we are trying to focus on specific "non daily work" things. that conflicts a bit with regular review" [puppet] - 10https://gerrit.wikimedia.org/r/901676 (https://phabricator.wikimedia.org/T329857) (owner: 10Ahmon Dancy) [22:31:09] (03PS2) 10Cwhite: logstash: add k8s statsd-exporter ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901631 (https://phabricator.wikimedia.org/T234565) [22:38:34] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:34] RECOVERY - Check systemd state on dborch1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:47:54] (03PS1) 10BCornwall: application_servers/kafka: Remove IDs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901719 (https://phabricator.wikimedia.org/T331656) [22:48:53] (03CR) 10BCornwall: "brett@grafana1002:~/grafana-grizzly$ grr diff static_dashboards.jsonnet" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/901719 (https://phabricator.wikimedia.org/T331656) (owner: 10BCornwall) [22:49:52] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:06] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10jsn.sherman) Considering that all of the impacted cl... [22:51:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:52:09] (03PS1) 10Zabe: Revert "dewiki: Allow 'crats to remove sysopship and manage importers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901722 [22:54:51] (03PS2) 10Zabe: Revert "dewiki: Allow 'crats to remove sysopship and manage importers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901722 [22:56:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [22:56:12] (03PS1) 10Zabe: Add messages for Central Kurdish Wiktionary (ckbwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901649 (https://phabricator.wikimedia.org/T331831) [22:56:25] jouncebot: nowandnext [22:56:25] No deployments scheduled for the next 7 hour(s) and 3 minute(s) [22:56:25] In 7 hour(s) and 3 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230322T0600) [22:56:39] (03PS1) 10Bartosz Dziewoński: Document running persistRevisionThreadItems.php for wgExtraSignatureNamespaces changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) [22:56:49] (03CR) 10Zabe: [C: 03+2] Add messages for Central Kurdish Wiktionary (ckbwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901649 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [22:57:18] (03PS1) 10Zabe: Add messages for Angika Wikipedia (anpwiki) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901650 (https://phabricator.wikimedia.org/T332115) [22:57:27] (03PS2) 10Zabe: Add messages for Angika Wikipedia (anpwiki) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901650 (https://phabricator.wikimedia.org/T332115) [22:57:34] (03CR) 10Zabe: [C: 03+2] Add messages for Angika Wikipedia (anpwiki) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901650 (https://phabricator.wikimedia.org/T332115) (owner: 10Zabe) [22:58:04] (03CR) 10Zabe: [C: 03+2] Revert "dewiki: Allow 'crats to remove sysopship and manage importers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901722 (owner: 10Zabe) [22:58:51] (03Merged) 10jenkins-bot: Revert "dewiki: Allow 'crats to remove sysopship and manage importers" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901722 (owner: 10Zabe) [23:00:00] !log zabe@deploy2002 Started scap: [[gerrit:901722|Revert "dewiki: Allow 'crats to remove sysopship and manage importers"]] [23:03:30] (03CR) 10Esanders: [C: 03+1] "Documentation-only change." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901723 (https://phabricator.wikimedia.org/T332745) (owner: 10Bartosz Dziewoński) [23:06:21] (03PS1) 10Bartosz Dziewoński: Clean up DiscussionTools labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901724 [23:07:10] !log zabe@deploy2002 Finished scap: [[gerrit:901722|Revert "dewiki: Allow 'crats to remove sysopship and manage importers"]] (duration: 07m 10s) [23:08:44] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:09:48] PROBLEM - Check systemd state on mw1372 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:14:13] (03Merged) 10jenkins-bot: Add messages for Central Kurdish Wiktionary (ckbwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901649 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe) [23:14:15] (03Merged) 10jenkins-bot: Add messages for Angika Wikipedia (anpwiki) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901650 (https://phabricator.wikimedia.org/T332115) (owner: 10Zabe) [23:15:55] !log zabe@deploy2002 Started scap: Backport for [[gerrit:901650|Add messages for Angika Wikipedia (anpwiki) (T332115)]], [[gerrit:901649|Add messages for Central Kurdish Wiktionary (ckbwiktionary) (T331831)]] [23:16:02] T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115 [23:16:02] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [23:18:10] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:21:07] (RedisMemoryFull) firing: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:26:07] (RedisMemoryFull) resolved: Redis memory full on rdb1011:16380 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://grafana.wikimedia.org/d/000000174/redis?orgId=1&var-datasource=eqiad%20prometheus/ops&var-job=redis_misc&var-instance=rdb1011:16380&viewPanel=16 - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull [23:30:01] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover, 10Patch-For-Review: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) >>! In T332650#8716622, @jsn.sherman wrote: > C... [23:30:46] (03CR) 10Esanders: [C: 03+1] Clean up DiscussionTools labs config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901724 (owner: 10Bartosz Dziewoński) [23:35:21] !log zabe@deploy2002 zabe: Backport for [[gerrit:901650|Add messages for Angika Wikipedia (anpwiki) (T332115)]], [[gerrit:901649|Add messages for Central Kurdish Wiktionary (ckbwiktionary) (T331831)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [23:35:28] T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115 [23:35:28] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [23:38:54] PROBLEM - Check systemd state on db1121 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:41:58] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:46:04] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:901650|Add messages for Angika Wikipedia (anpwiki) (T332115)]], [[gerrit:901649|Add messages for Central Kurdish Wiktionary (ckbwiktionary) (T331831)]] (duration: 30m 08s) [23:46:11] T332115: Create Wikipedia Angika - https://phabricator.wikimedia.org/T332115 [23:46:11] T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831 [23:46:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:50:08] (03PS1) 10Zabe: Initial configuration for anpwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/901727 (https://phabricator.wikimedia.org/T332115) [23:50:14] RECOVERY - Check systemd state on db1121 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:53:21] (03PS1) 10Zabe: Add namespace translations for Angika [extensions/Gadgets] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901652 (https://phabricator.wikimedia.org/T332118) [23:53:28] (03CR) 10Zabe: [C: 03+2] Add namespace translations for Angika [extensions/Gadgets] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901652 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [23:53:39] (03PS1) 10Zabe: Add namespace translations for Angika [extensions/Scribunto] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901653 (https://phabricator.wikimedia.org/T332118) [23:53:43] (03CR) 10Zabe: [C: 03+2] Add namespace translations for Angika [extensions/Scribunto] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901653 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [23:54:12] (03CR) 10Zabe: [C: 03+2] "This change is ready for review." [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901651 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [23:56:08] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [extensions/Gadgets] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901652 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [23:56:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [extensions/Scribunto] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901653 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe) [23:56:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [core] (wmf/1.40.0-wmf.27) - 10https://gerrit.wikimedia.org/r/901651 (https://phabricator.wikimedia.org/T332118) (owner: 10Zabe)