[00:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:50:29] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [01:12:06] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [01:18:21] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:50:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:18:21] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [08:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:22:16] (EventgateValidationErrors) firing: ... [08:22:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:30:57] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) [08:32:16] (EventgateValidationErrors) resolved: ... [08:32:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:34:25] 10Data-Engineering, 10serviceops, 10Event-Platform, 10Patch-For-Review: Upgrade change propagation to nodejs18 - https://phabricator.wikimedia.org/T348950 (10elukey) All changeprop instances (except beta) run nodejs 18 and the last version of node-rdkafka. As explained above, there was an increase in cpu u... [08:45:51] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform, 10MediaWiki-Platform-Team (Radar): Undelete of page with same title leads to unexpected results - https://phabricator.wikimedia.org/T351411 (10pfischer) Thank you @Krinkle for the detailed answer. 🙇 I was not aware of those procedures. TIL: page r... [09:06:16] (EventgateValidationErrors) firing: ... [09:06:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:14:15] (03PS1) 10Brouberol: Replace an-druid1001 by an-druid1001 in druid connection strings [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 [09:15:06] (03PS2) 10Brouberol: Replace an-druid1001 by an-druid1001 in druid connection strings [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) [09:16:15] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10CodeReviewBot) brouberol updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/543 Replace an-druid1001 by an-druid1001 in druid connectio... [09:18:21] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:52] (03PS3) 10Brouberol: Replace an-druid1001 by an-druid1002 in druid connection strings [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) [09:28:05] 10Data-Engineering, 10CX-cxserver, 10Citoid, 10Content-Transform-Team-WIP, and 9 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10elukey) [09:34:12] (03CR) 10Btullis: [C: 03+1] "I /think/ that these oozie configuration files are no longer used and can therefore be removed altogether, but that would be for another d" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [09:34:52] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10CodeReviewBot) brouberol merged https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/543 Replace an-druid1001 by an-druid1002 in druid connection... [09:41:05] 10Data-Engineering, 10serviceops, 10Discovery-Search (Current work), 10Event-Platform, 10Patch-For-Review: Improve the flink-app chart to provide more useful defaults - https://phabricator.wikimedia.org/T346315 (10Gehel) 05Open→03Resolved [09:44:00] stevemunene: can I help with reimaging the buster nodes of the public druid cluster? [09:44:24] druid[1004-1008].eqiad.wmnet are still on buster [09:50:30] druid100[4-6] are in the process to be decommissioned. we are also reimaging the newer nodes as well probably before [09:58:18] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10jbond) @bking i took a look at cloudelastic1010 as i had thought this was in some broken state from the reimage cookbook. however from the puppet certs i can see its been around since `Nov 9 07:... [09:58:47] oh ok, so we'd only need to reimage druid100[7-8]? [10:00:22] if so, I can take one [10:01:49] (tell me if you need/want me to) [10:11:17] thanks brouberol I was about to get started on druid1009, I don't think we can do 2 at a time [10:12:14] yep, we shouldn't, as the segments are only replicated twice [10:50:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [11:28:02] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10CodeReviewBot) stevemunene opened https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/544 switch druid host to index to the druid-public cluster and datahub injestion. [11:55:18] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) > having all the 'domain' in the header in general might be useful if someone wants/needs to do some custom filtering Yes, absolutely. [12:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:22:33] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) > having all the 'domain' in the header in general might be useful if someone wants/needs to do some custom filtering Yes, absolutely. I'll adapt the ACs. [12:26:16] (EventgateValidationErrors) resolved: ... [12:26:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:28:07] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) [12:29:20] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) [12:29:42] 10Data-Engineering: Utilize Kafka Headers - https://phabricator.wikimedia.org/T351089 (10pfischer) [12:53:58] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10mpopov) Not changing global config and instead just documenting the solution / how to use that setting temporarily for backfills sounds good to me. I think this is i... [12:55:15] !log Rerun Airflow metadata_ingest_daily datahub job [12:55:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:13:16] (EventgateValidationErrors) firing: ... [13:13:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:18:21] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:18:57] (03PS2) 10Bearloga: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [13:20:53] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10BTullis) >>! In T351117#9335499, @Fabfur wrote: > Sure! The main task was https://phabricator.wikimedia.org/T323557 Hi @Fabfur, thanks for this. I can follow the... [13:31:39] (03CR) 10Bearloga: Create mediawiki/wiki_highlights_experiment (036 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [13:32:45] 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10BTullis) >>! In T351388#9340565, @mpopov wrote: > Not changing global config and instead just documenting the solution / how to use that setting temporarily for backfills sounds good to m... [13:38:16] (EventgateValidationErrors) resolved: ... [13:38:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:40:39] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "analytics/pageview-api" (20120124) - https://phabricator.wikimedia.org/T351518 (10Aklapper) [13:40:49] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "analytics/reportcard" and "analytics/reportcard/data" (20120810, 20160929) - https://phabricator.wikimedia.org/T351521 (10Aklapper) [13:40:57] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "analytics/vagrant/build" (20130527) - https://phabricator.wikimedia.org/T351525 (10Aklapper) [13:58:06] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:58:21] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:21] (SystemdUnitFailed) firing: (5) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:10] 10Data-Engineering, 10Observability-Logging, 10Traffic: Move analytics log from Varnish to HAProxy - https://phabricator.wikimedia.org/T351117 (10Fabfur) Hi @BTullis, thanks for your precious information, I'll try to summarize with some points below what we're trying to do now, as I can understand I was not... [14:08:07] PROBLEM - Check systemd state on an-worker1100 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:33] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) 05Open→03Resolved [14:29:47] (SystemdUnitFailed) firing: (5) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:07] 10Data-Platform-SRE: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10BTullis) 05Open→03Resolved [14:31:38] 10Data-Platform-SRE, 10Patch-For-Review: Bring an-mariadb100[12] into service - https://phabricator.wikimedia.org/T284150 (10BTullis) 05Open→03Resolved [14:31:40] 10Data-Platform-SRE: Analytics coordinator failover improvements - https://phabricator.wikimedia.org/T280905 (10BTullis) [14:32:53] 10Data-Engineering, 10Diffusion-Repository-Administrators, 10Projects-Cleanup, 10Wikimedia-GitHub: Consider archiving Gerrit repository "analytics/vagrant/build" (20130527) - https://phabricator.wikimedia.org/T351525 (10Ottomata) Please delete, this repo is not in use. [14:35:37] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10bking) @jbond Sorry for the confusion, I associated the reimage with the wrong ticket. The output of the last reimage is [[ https://phabricator.wikimedia.org/T350826#9338708 | here ]] . Puppet w... [14:39:13] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10jbond) @bking in order for me to investigate further i need either broken host to investigate or a way to replicate the issue. [14:41:30] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:14] RECOVERY - Check systemd state on an-coord1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:42:36] PROBLEM - Check systemd state on an-worker1105 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:21] (SystemdUnitFailed) firing: (6) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:55:40] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:56:00] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:58:21] (SystemdUnitFailed) firing: (4) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:58:27] !log marked several failed tasks of datahub_ingestion DAG in Airflow, because the issues were fixed, added notes to the DAG itself [14:58:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:00:34] RECOVERY - Check systemd state on an-worker1100 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:03:21] (SystemdUnitFailed) firing: (4) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:14] stevemunene: here's a PR exporting the kafka replication factor of each topic to prometheus. Based on this. we could graph it, as well as alert on it as well. If we have a production topic with RF=1, any broker going down could bring partitions of this topic offline [15:18:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/975291 [15:40:44] RECOVERY - Check systemd state on an-worker1105 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:43:21] (SystemdUnitFailed) firing: (2) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:05:49] (03PS6) 10Milimetric: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [16:06:09] (03CR) 10Milimetric: "lol, thx" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [16:07:54] 10Data-Platform-SRE, 10Discovery-Search: Reduce network impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) Per today's IRC discussion in the security channel, @CDanis mentioned detuning or removing the LVS alerts for internal hosts. So I'll set this one to blocked at the moment.... [16:09:13] 10Data-Platform-SRE, 10Discovery-Search: Reduce network impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) [16:09:22] 10Data-Platform-SRE, 10Discovery-Search: Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) [16:16:38] PROBLEM - Check systemd state on an-worker1120 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:18:21] (SystemdUnitFailed) firing: (2) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:30:43] 10Data-Platform-SRE: Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10BTullis) [16:31:59] 10Data-Platform-SRE: Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10BTullis) [16:32:01] 10Data-Engineering, 10Data-Platform-SRE: Migrate matomo to Debian bullseye (or bookworm) - https://phabricator.wikimedia.org/T349397 (10BTullis) [16:32:08] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10Ahoelzl) [16:33:06] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10Ahoelzl) [16:35:49] 10Data-Platform-SRE: Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10BTullis) p:05Triage→03High I believe that we can likely do this work as part of {T349397} since we will need to deploy a new VM for matomo anyway. We can probably upgrade straight... [16:57:47] (03CR) 10Milimetric: [V: 03+2 C: 03+2] Introduce MostTranscludedPages.hql (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [17:13:10] RECOVERY - Check systemd state on an-worker1120 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:18:21] (SystemdUnitFailed) firing: (2) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:42:27] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10BTullis) @sbassett - many thanks for your input regarding this Matomo plugin (and also the `TagManager` plugin) on T349910#933... [18:40:46] 10Data-Platform-SRE, 10Discovery-Search: Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) [19:49:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:56:48] 10Data-Engineering, 10Event-Platform: Implement a new notification only revision-visibility-change stream - https://phabricator.wikimedia.org/T351565 (10xcollazo) [20:03:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.297% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:25:16] (EventgateValidationErrors) firing: ... [20:25:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:30:55] 10Data-Engineering, 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:rack/setup/install an-master100[3-4] - https://phabricator.wikimedia.org/T342291 (10RobH) [20:35:16] (EventgateValidationErrors) resolved: ... [20:35:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:41:00] (EventgateValidationErrors) firing: ... [20:41:01] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:45:46] (EventgateValidationErrors) resolved: ... [20:45:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:18:21] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:50:14] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange