[00:08:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:18:22] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:50:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [04:08:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [05:19:48] (SystemdUnitFailed) firing: refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:57:52] PROBLEM - Check systemd state on an-worker1131 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:22] (SystemdUnitFailed) firing: (2) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:32] good morning, having a look at the refinery-druid-drop-public-snapshots.service above [07:50:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [07:55:19] RECOVERY - Check systemd state on an-worker1131 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:57:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:58:22] (SystemdUnitFailed) firing: (2) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:59:49] (SystemdUnitFailed) resolved: (2) refinery-druid-drop-public-snapshots.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:21] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:15:57] (03PS3) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [08:16:34] (03CR) 10CI reject: [V: 04-1] Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [08:18:15] (EventgateValidationErrors) firing: ... [08:18:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:23:15] (EventgateValidationErrors) resolved: ... [08:23:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:35:47] (03PS4) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [08:53:15] (EventgateValidationErrors) firing: ... [08:53:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:42:29] (03CR) 10Aqu: [V: 03+1 C: 03+1] "Looks good. Thanks." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 (owner: 10Mforns) [09:48:59] (03CR) 10Aqu: Replace an-druid1001 by an-druid1002 in druid connection strings (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [10:58:58] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by stevemunene@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye [11:11:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:21:42] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:00] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [11:34:02] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I can verify that the statsd_exporter is now running on an-test-client1002 after the latest puppet run. No airflow specific metr... [11:37:54] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10BTullis) I can confirm that these are being scraped into prometheus as well. {F41521867} [[https://grafana-rw.wikimedia.org/explore?orgId... [11:48:40] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by stevemunene@cumin1001 for host druid1009.eqiad.wmnet with OS bullseye completed: - druid1009 (**WARN**) - Downtimed on Icinga/... [11:50:16] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [12:00:03] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Deploy Spark History Service - https://phabricator.wikimedia.org/T330176 (10BTullis) a:03BTullis [12:00:26] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) 05Open→03Resolved [12:08:16] (EventgateValidationErrors) resolved: ... [12:08:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:09:49] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:11:42] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:47] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Patch-For-Review: [Data Platform] Stop and remove oozie services - https://phabricator.wikimedia.org/T341893 (10BTullis) Good progress so far. I'd like to proceed to: Remove the oozie integration from hue | https://gerrit.wikimedia.org/r/c/operations/pup... [12:21:42] (SystemdUnitFailed) resolved: (2) export_smart_data_dump.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:24:18] 10Data-Platform-SRE, 10Patch-For-Review: Decommission druid100[4-6] - https://phabricator.wikimedia.org/T336043 (10Stevemunene) Reimaged `druid1009` and the host is up with the right partitions and rebalancing [12:36:15] (EventgateValidationErrors) firing: ... [12:36:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [12:37:18] 10Quarry: Find somewhere else (not NFS) to store Quarry's resultsets - https://phabricator.wikimedia.org/T178520 (10taavi) [13:55:37] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: WDQS graph split: load data from dumps into new hosts - https://phabricator.wikimedia.org/T347504 (10bking) Update: The wikidata dump finished on wdqs1022 ( Wikidata dump loaded in 25 days, 13:32:17.263762) . But all 3 hosts are stuck at the moment... [13:59:38] (03PS4) 10Clare Ming: Add custom schemas for 2 Android article instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) [14:01:50] (03CR) 10Clare Ming: "Per convo with @Phuedx, we will place custom schemas inside a `product_metrics` directory of the feature team's folder using their naming " [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [14:22:26] 10Data-Platform-SRE, 10Patch-For-Review: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [14:36:50] 10Data-Engineering, 10Data-Platform-SRE, 10Privacy Engineering, 10SecTeam-Processed: Enable the TagManager plugin for Matomo - https://phabricator.wikimedia.org/T349910 (10BTullis) 05Resolved→03Open I'm reopening this ticket, since we have noticed that the plugin does not yet function correctly. When... [14:41:16] (EventgateValidationErrors) resolved: ... [14:41:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [14:50:36] (03CR) 10Joal: [C: 03+1] "Thanks so much Marcel fox fixing this - The original fix was broken in the same way the problem was occurring (duplicate map keys :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975418 (owner: 10Mforns) [14:54:57] 10Data-Engineering, 10Data-Platform-SRE, 10Foundational Technology Requests: Enable the Marketing Campaigns Reporting plugin for matomo - https://phabricator.wikimedia.org/T319013 (10sbassett) >>! In T319013#9341604, @BTullis wrote: > As it's relevant, I should also mention that I have created T351552, which... [15:04:54] 10Data-Engineering (Sprint 5): [Data Quality] Implement Simple Monitoring Dashboard for Airflow Jobs - https://phabricator.wikimedia.org/T349532 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add statsd as a dependency to our setup [15:05:01] 10Data-Engineering (Sprint 5), 10Data-Platform-SRE, 10Observability-Metrics: Configure Airflow to send metrics to Prometheus - https://phabricator.wikimedia.org/T343232 (10CodeReviewBot) aqu updated https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/537 Add statsd as a depende... [15:07:56] (03CR) 10Joal: "Indeed - Dead code - this patch can be abandonned (sorry for the extra work @brouberol)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/975206 (https://phabricator.wikimedia.org/T332604) (owner: 10Brouberol) [15:12:03] 10Data-Platform-SRE: Reduce impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10Gehel) [15:12:28] PROBLEM - Check systemd state on an-druid1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:42] (SystemdUnitFailed) firing: prometheus_puppet_agent_stats.timer Failed on an-worker1097:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:14:10] PROBLEM - Check systemd state on an-worker1097 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:46] PROBLEM - Check systemd state on analytics1071 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:14] PROBLEM - Check systemd state on an-worker1128 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:14] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Allow federated queries with the MiMoTextBase SPARQL endpoint - https://phabricator.wikimedia.org/T351488 (10Gehel) [15:15:38] PROBLEM - Check systemd state on an-worker1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:28] PROBLEM - Check systemd state on an-worker1089 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:17:42] (SystemdUnitFailed) firing: (8) prometheus_puppet_agent_stats.timer Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:19:16] (EventgateValidationErrors) firing: ... [15:19:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:22:43] (SystemdUnitFailed) firing: (13) prometheus_puppet_agent_stats.timer Failed on an-master1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:27:43] (SystemdUnitFailed) firing: (17) prometheus_puppet_agent_stats.timer Failed on an-druid1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:28:31] query mysql from turnilo [15:28:37] wooops- wrong window :) [15:32:43] (SystemdUnitFailed) resolved: (17) prometheus_puppet_agent_stats.timer Failed on an-druid1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:12] (SystemdUnitFailed) firing: (3) prometheus_puppet_agent_stats.timer Failed on an-worker1109:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:16] (EventgateValidationErrors) resolved: ... [15:34:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:38:12] (SystemdUnitFailed) resolved: (17) prometheus_puppet_agent_stats.timer Failed on an-druid1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:39:12] (SystemdUnitFailed) firing: (6) prometheus_puppet_agent_stats.timer Failed on an-druid1004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:13] (SystemdUnitFailed) resolved: (16) prometheus_puppet_agent_stats.timer Failed on an-master1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:45:01] (PuppetConstantChange) resolved: Puppet performing a change on every puppet run on aqs1012:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [15:52:16] PROBLEM - Check systemd state on an-worker1113 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:55:12] (SystemdUnitFailed) firing: (12) prometheus_puppet_agent_stats.timer Failed on an-worker1084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:56:15] (EventgateValidationErrors) firing: ... [15:56:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:01:15] (EventgateValidationErrors) resolved: ... [16:01:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:05:12] (SystemdUnitFailed) firing: (2) export_smart_data_dump.service Failed on an-worker1113:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:06:03] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Launch two new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [16:09:49] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:10:28] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10bking) 05Resolved→03In progress a:05Gehel→03bking [16:10:33] 10Data-Platform-SRE, 10Discovery-Search (Current work), 10Patch-For-Review: Ensure WDQS stack works on Bullseye - https://phabricator.wikimedia.org/T331300 (10bking) [16:12:03] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10bking) Reopening per today's IRC conversation. We really need this process to be faster, so we'll try enabling the performance governor and... [16:26:07] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) [16:28:02] 10Data-Platform-SRE, 10serviceops, 10Discovery-Search (Current work): Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 (10Gehel) [16:35:37] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Launch 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [16:35:41] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Launch 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [16:35:54] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Launch 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10RKemper) [16:42:24] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) After merging, I am deploying the new docker image to beta by editing Puppet Configuration for [[ https://horizon.wiki... [16:43:20] !log reran Airflow's refine_webrequest_hourly_text::refine_webrequest with excluded_row_ids for 2023-11-19T21 [16:43:22] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:50:12] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on an-worker1113:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:11:04] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [17:41:09] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10[DEPRECATED] wdwb-tech, and 2 others: Migrate the wdqs streaming updater flink jobs to flink-k8s-operator deployment model - https://phabricator.wikimedia.org/T326409 (10bking) [17:44:43] 10Analytics, 10AQS2.0, 10Tech-Docs-Team, 10API Platform (AQS 2.0 Roadmap), and 4 others: AQS 2.0 documentation - https://phabricator.wikimedia.org/T288664 (10WDoranWMF) [17:45:46] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) Ah, this failed. I believe I need to run a more up to date docker host node? Trying with bookworm. - Created a new... [17:48:31] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2087-2091 - https://phabricator.wikimedia.org/T349778 (10Jhancock.wm) a:03Jhancock.wm [17:50:03] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye completed: - cloudelastic1010 (**PASS**) - Removed fr... [17:51:06] 10Data-Platform-SRE, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install elastic2092-2109 - https://phabricator.wikimedia.org/T349780 (10Jhancock.wm) a:03Jhancock.wm [17:52:38] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) I was successful in moving changeprop to a new cloud vps node. I updated the docs at https://wikitech.wikimedia.org/w... [18:03:37] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) I tried making a new deployment-cpjobqueue-1 node to keep things separate. Getting the same error again: `lang=json... [18:09:06] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) Looks like a bug in https://github.com/gwicke/tassembly/blob/master/tassembly.js#L526 with modern NodeJS? Dunno why w... [18:09:59] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) Giving up for this task. reverting cpjobqueue to old image in beta. [18:12:49] 10Data-Engineering (Sprint 5), 10Event-Platform, 10Patch-For-Review: [Event Platform] change propagation should discard canary events - https://phabricator.wikimedia.org/T351247 (10Ottomata) Anyway, in summary, I believe we can deploy this in December after I'm back and before we enable canary events for all... [18:34:17] 10Data-Engineering, 10Data Products (Sprint 01): [Spike] Identify and mitigate risks associated with MediaWiki History pipeline - https://phabricator.wikimedia.org/T345208 (10WDoranWMF) 05Open→03Resolved [18:34:46] 10Data-Engineering, 10Data Pipelines (Sprint 14), 10Data Products (Sprint 01), 10Google-Chrome-User-Agent-Deprecation, 10Product-Analytics (Kanban): [SPIKE] Model impact of User-Agent deprecation on top line metrics - https://phabricator.wikimedia.org/T336084 (10WDoranWMF) 05Open→03Resolved [19:07:37] 10Data-Platform-SRE: Service implementation for wdqs1017-1021 - https://phabricator.wikimedia.org/T351671 (10bking) [19:13:17] 10Data-Engineering (Sprint 5): [Maintenance] Understand and inventory change-propagation use cases, deployments, and custom business logic - https://phabricator.wikimedia.org/T350156 (10Ottomata) Alright, first draft of this is on wikitech at https://wikitech.wikimedia.org/wiki/Changeprop/Memorandum-2023-11#Otto... [19:34:33] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [19:36:59] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1013.eqiad.wmnet with OS bullseye [19:40:11] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [19:41:55] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 (10Gehel) [19:43:33] 10Data-Platform-SRE, 10Discovery-Search (Current work): Load Wikidata split graphs into test servers - https://phabricator.wikimedia.org/T350465 (10Gehel) [20:08:44] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1013.eqiad.wmnet with OS bullseye completed: - aqs1013 (**PASS**) - Downtimed on Icinga/Alertmanage... [20:09:37] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [20:09:49] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.295% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:10:33] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1014.eqiad.wmnet with OS bullseye [20:16:15] (EventgateValidationErrors) firing: ... [20:16:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:21:21] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1014.eqiad.wmnet with OS bullseye executed with errors: - aqs1014 (**FAIL**) - Downtimed on Icinga/... [20:21:40] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1014.eqiad.wmnet with OS bullseye [20:26:16] (EventgateValidationErrors) resolved: ... [20:26:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:32:12] 10Data-Engineering (Sprint 5): [Data Quality] Develop Airflow post processing instrumentation to collect and log configurable data metrics - https://phabricator.wikimedia.org/T349763 (10gmodena) Summary discussions we had over call / slack / doc. I did some PoC to generate dataset metrics for `webrequests`. As... [20:38:46] (03CR) 10Conniecc1: Create mediawiki/wiki_highlights_experiment (035 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [20:47:07] 10Data-Platform-SRE: Service implementation for wdqs1017-1021 - https://phabricator.wikimedia.org/T351671 (10bking) Let's leave wdqs1021 out for now, as we need it for performance testing in T351662 [20:47:40] 10Data-Platform-SRE: Service implementation for wdqs1017-1020 - https://phabricator.wikimedia.org/T351671 (10bking) [20:47:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on an-worker1156:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:51:47] PROBLEM - Check systemd state on an-worker1156 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:02:03] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1014.eqiad.wmnet with OS bullseye completed: - aqs1014 (**WARN**) - Removed from Puppet and PuppetD... [21:02:37] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [21:11:21] 10Data-Platform-SRE: Upgrade matomo (piwik.wikimedia.org) to latest stable version - https://phabricator.wikimedia.org/T351552 (10Aklapper) [21:18:10] (03PS5) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [21:18:51] (03CR) 10CI reject: [V: 04-1] Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [21:19:20] (03PS6) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [21:20:33] (03CR) 10Conniecc1: Create mediawiki/wiki_highlights_experiment (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [21:47:03] RECOVERY - Check systemd state on an-worker1156 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:47:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on an-worker1156:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed