[00:14:29] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 1.145e-06% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [00:27:28] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1008.wikimedia.org with OS bullseye executed with errors: - cloudelastic1008 (**FAIL**... [00:34:43] (SystemdUnitFailed) firing: (2) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:41:59] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [01:34:43] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:58:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.298% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:14:30] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 1.145e-06% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [04:25:01] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye [04:42:14] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [05:34:58] (SystemdUnitFailed) firing: (2) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:14] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by ryankemper@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye executed with errors: - cloudelastic1007 (**FAIL**... [06:34:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:37:11] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:05] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:48:11] PROBLEM - Check systemd state on kafka-jumbo1007 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:54:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:55:05] RECOVERY - Check systemd state on kafka-jumbo1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:49:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:58:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.298% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:10:16] (EventgateValidationErrors) firing: ... [08:10:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:14:30] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 1.145e-06% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [08:25:16] (EventgateValidationErrors) resolved: ... [08:25:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:30:45] (EventgateValidationErrors) firing: ... [08:30:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:40:46] (EventgateValidationErrors) resolved: ... [08:40:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [08:42:14] (PuppetFailure) firing: Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:43:16] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:44:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:53:46] (EventgateValidationErrors) firing: ... [08:53:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [09:40:48] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:44:43] (SystemdUnitFailed) firing: (3) monitor_refine_event.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:01:59] (PuppetFailure) firing: (2) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:08:33] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye [10:40:34] We're creating a report based on some expensive event db queries, and we haven't gotten the Superset dataset caching to work at all. Anyone here have hints for us? [10:41:59] (PuppetFailure) firing: (2) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [10:43:29] Ah, sorry. Superset results caching is off on purpose for now, because it didn't respect the access controls of the back-end database. We hope to be able to enable it soon, when we can upgrade to version 3 or 3 of superset. https://phabricator.wikimedia.org/T273850#9176858 [10:43:36] awight: ^^ [10:43:46] But that doesn't help you in the meantime. [10:43:54] btullis: Very helpful to know, though! [10:44:05] Ah btullis you're too fast :) I was about to give the same answer :) [10:44:24] :-) [10:45:31] I love when the answer suggests that I'm not the crazy one, anyway ;-) [10:48:46] BTW, we're open to using any platform--is Superset still the recommended place to build internal dashboards like mine, which would be backed by expensive event queries? [10:49:17] We have some reportupdater-queries which haven't run in a year but we don't understand whether it was due to deprecation or if there's just a config issue. [10:50:33] Here's my last breadcrumb about looking into the reportupdater logs: https://phabricator.wikimedia.org/T347758#9227639 [10:50:43] awight: What about using the staging mysql database as an intermediary? Could you somehow pre-retrieve them and store them to the staging database? [10:50:50] https://www.irccloud.com/pastebin/zy6G0OkI/ [10:52:26] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jbond@cumin1001 for host cloudelastic1007.wikimedia.org with OS bullseye completed: - cloudelastic1007 (**WARN**) - Removed fr... [10:54:13] The mysql_staging database is no longer exposed in SQL lab since this change T337056 but it is still accessible for charts. [10:55:13] btullis: Interesting! You've given me the idea to run the existing hive queries manually and send results to my personal schema, which seems to already be exposed through presto :-) [10:55:22] folks the new druid nodes are showing up with zero space left on the /srv partition, and https://phabricator.wikimedia.org/T336042 is closed.. What is the status? IIRC they should have been reimaged.. [10:55:36] (4 disks are not used at the moment) [10:56:44] stevemunene: ^^ did you do the reimage on these recently? [11:02:22] not yet btullis the decommissioningNodes mode was/is still in progress. waiting for a more stable druid status before the reimage. open to more opinions on the same topic https://phabricator.wikimedia.org/T336043#9334531 [11:05:41] OK, so the status of druid100[4-6] is that they're still up and running, but are depooled from LVS and have been set manually to be in decommissioning mode from the Druid web interface. Is that right? [11:05:42] stevemunene: I don't recall how the druid decom works, but if the segments are re-assigned to new nodes and there is no space it may not finish [11:07:00] (PuppetFailure) firing: (3) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:07:39] I agree with elukey - perhaps we should 'unset' decommissioning to allow the whole dataset to load across the cluster, then reimage druid10[09-11] with the 8 disk array before continuing. [11:08:20] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+2] Remove deprecated tech wish scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [11:08:48] in theory IIRC the druid nodes don't really have any data on them, it is just a cache from hdfs, so their state should be revertable any time [11:08:54] (03Merged) 10jenkins-bot: Remove deprecated tech wish scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/973318 (https://phabricator.wikimedia.org/T350411) (owner: 10WMDE-Fisch) [11:09:33] yes that is right btullis , the druid decom should also be able to accommodate the change to the new cluster since none of them were close to the 1.3T max side. I also see the angle to undecom the druid100[4-6] nodes then do the reimage [11:13:22] PROBLEM - Check systemd state on an-coord1002 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:14:43] (SystemdUnitFailed) firing: (3) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:24] Right, but we've got 3 nodes with 0% free capacity on `/srv` at the moment, so I think we should probably address this as soon as possible. Would you be able to undo the decommissioning mode, then we can see if we get some free space on druid10[09-11]? [11:15:48] There are some instructions here: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Druid#Removing_hosts/_taking_hosts_out_of_service_from_cluster - I find the coordinator web interface way the simplest. [11:15:57] sure lemme get onto that [11:16:02] Thx [11:16:59] (PuppetFailure) firing: (3) Puppet has failed on an-tool1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:17:43] I have tried silencing this an-tool1005 puppet failure, but it hasn't taken for some reason. It can be ignored for now. [11:18:48] undone the druid decommissioningMode, giving it some time then we can start the reimage of the new hosts. [11:21:13] 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10JAllemandou) [11:21:45] 10Data-Engineering: NEW BUG REPORT Some DAG run attempts fail because File *_temporary/0 does not exist. - https://phabricator.wikimedia.org/T347076 (10JAllemandou) > Oh hey, looks like we're not alone https://community.cloudera.com/t5/Support-Questions/How-to-change-Spark-temporary-directory-when-writing-data/m... [11:36:20] 10Data-Engineering, 10Edit-Review-Improvements-Integrated-Filters, 10Growth-Team, 10Machine-Learning-Team, and 2 others: Integration of Revert Risk Scores to Recent Changes as a filter - https://phabricator.wikimedia.org/T329071 (10kostajh) There is active work on this in {T348298} [11:50:35] (03PS1) 10Lucas Werkmeister (WMDE): Remove deprecated tech wish scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/974243 (https://phabricator.wikimedia.org/T350411) [11:50:46] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] Remove deprecated tech wish scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/974243 (https://phabricator.wikimedia.org/T350411) (owner: 10Lucas Werkmeister (WMDE)) [11:51:21] (03CR) 10Phuedx: "Hrrm. How about placing them inside of product_metrics rather than having to introduce metrics_platform to differentiate between the new a" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [11:51:35] (03Merged) 10jenkins-bot: Remove deprecated tech wish scripts [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/974243 (https://phabricator.wikimedia.org/T350411) (owner: 10Lucas Werkmeister (WMDE)) [11:58:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:06:59] (PuppetFailure) firing: Puppet has failed on an-worker1097:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:12:43] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) I'm going to reprocess all of the available historical data, which means that we //may// get better geolocation data from the last 30 days. Our current co... [12:14:30] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 0.2858% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [12:16:22] (03CR) 10WMDE-Fisch: "TY!" [analytics/wmde/scripts] - 10https://gerrit.wikimedia.org/r/974243 (https://phabricator.wikimedia.org/T350411) (owner: 10Lucas Werkmeister (WMDE)) [12:16:59] (PuppetFailure) resolved: Puppet has failed on an-worker1097:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:17:57] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) Here are the options available when running the `usercountry:attribute` command. ` www-data@matomo1002:/usr/share/matomo$ ./console usercountry:attribute... [12:25:25] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) So far so good. ` www-data@matomo1002:/usr/share/matomo$ ./console usercountry:attribute 2023-10-17,2023-11-16 Re-attribution for date range: 2023-10-17 t... [12:26:59] (PuppetFailure) firing: (2) Puppet has failed on an-worker1097:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:31:50] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) Invalidated reports: ` www-data@matomo1002:/usr/share/matomo$ ./console core:invalidate-report-data --dates 2023-10-17,2023-11-16 Invalidating day periods... [12:36:59] (PuppetFailure) resolved: Puppet has failed on an-worker1113:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:41:15] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) That job completed successfully. Here is the summary section of the log. ` INFO [2023-11-16 12:38:54] 3502 SUMMARY INFO [2023-11-16 12:38:54] 3502 Total... [12:47:17] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) [12:54:00] (EventgateValidationErrors) firing: ... [12:54:06] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [13:22:17] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:35] !log stat1008: Add `sowiki`, `stwiki`, `tgwiki` and `ugwiki` to `/srv/published/datasets/one-off/research-mwaddlink/wikis.txt` (T340944) [13:22:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:22:39] T340944: The published dataset's list of wikis misses a couple of wikis with existing data - https://phabricator.wikimedia.org/T340944 [13:24:43] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:36:59] (PuppetFailure) firing: (2) Puppet has failed on an-worker1125:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [13:46:49] 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10BTullis) a:03BTullis [13:47:11] 10Data-Platform-SRE: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10BTullis) p:05Triage→03High [13:47:58] 10Data-Platform-SRE: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10BTullis) a:03BTullis [13:48:29] (03PS3) 10Clare Ming: Add custom schemas for 2 Android article instruments [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) [13:49:37] (03CR) 10Clare Ming: Add custom schemas for 2 Android article instruments (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [13:51:55] PROBLEM - Check systemd state on an-worker1091 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:52:52] (03CR) 10Clare Ming: Add custom schemas for 2 Android article instruments (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/974674 (https://phabricator.wikimedia.org/T351292) (owner: 10Clare Ming) [13:54:17] PROBLEM - Check systemd state on an-presto1013 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:54:43] (SystemdUnitFailed) firing: (6) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:56:06] (03CR) 10Cyndywikime: Add analytics for Impressions, Success and Abandonment rate for temporary Users (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [14:01:59] (PuppetFailure) firing: (3) Puppet has failed on an-worker1085:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:03:48] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10BTullis) They look pretty good to me. I will mark this ticket as resolved, but await the outcome of further checking from @SCampos-WMF or @Ospingou to say whether... [14:11:58] 10Data-Platform-SRE, 10Patch-For-Review: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10BTullis) @JAllemandou - I have prepared this patch https://gerrit.wikimedia.org/r/c/operations/puppet/+/974993 which I believe should implement your request. I u... [14:11:59] (PuppetFailure) firing: (3) Puppet has failed on an-worker1085:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:15:04] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Undelete of page with same title leads to unexpected results - https://phabricator.wikimedia.org/T351411 (10pfischer) [14:21:59] (PuppetFailure) firing: (3) Puppet has failed on an-worker1085:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:26:59] (PuppetFailure) firing: (4) Puppet has failed on an-worker1085:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:31:42] 10Data-Platform-SRE, 10Patch-For-Review: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10JAllemandou) >>! In T349523#9337350, @BTullis wrote: > I used the short for of the URI - e.g. `hdfs:///user/hive/warehouse` in the spark3-defaults file, which I b... [14:31:59] (PuppetFailure) firing: (8) Puppet has failed on an-worker1080:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:44:33] (03CR) 10Urbanecm: [C: 03+1] "LGTM now. Will merge once the MediaWiki counterpart is ready." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/962569 (https://phabricator.wikimedia.org/T300273) (owner: 10Cyndywikime) [14:45:13] !log rolling out 974993: Add spark.sql.warehouse.dir to spark3 defaults | https://gerrit.wikimedia.org/r/c/operations/puppet/+/974993 for T349523 [14:45:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:45:20] T349523: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 [14:49:13] RECOVERY - Check systemd state on an-worker1091 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:33] RECOVERY - Check systemd state on an-presto1013 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:59] (PuppetFailure) firing: (10) Puppet has failed on an-worker1080:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:54:43] (SystemdUnitFailed) firing: (6) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:54:59] (PuppetFailure) firing: Puppet has failed on dse-k8s-worker1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:57:03] (PuppetFailure) firing: (14) Puppet has failed on an-presto1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [14:57:25] 10Data-Platform-SRE, 10Patch-For-Review: Update spark warehouse configuration to use the same as Hive - https://phabricator.wikimedia.org/T349523 (10BTullis) Marking as done. I will leave the ticket open for a day or two whilst we validate that it is OK. [14:57:59] (PuppetZeroResources) firing: Puppet has failed generate resources on an-worker1091:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [14:58:59] (PuppetFailure) firing: Puppet has failed on cephosd1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:01:11] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [15:01:34] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye [15:01:59] (PuppetFailure) firing: (14) Puppet has failed on an-presto1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:02:59] (PuppetZeroResources) firing: (2) Puppet has failed generate resources on an-presto1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:04:01] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye [15:04:59] (PuppetFailure) resolved: Puppet has failed on dse-k8s-worker1005:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:06:59] (PuppetFailure) firing: (14) Puppet has failed on an-presto1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:08:59] (PuppetFailure) resolved: Puppet has failed on cephosd1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:11:59] (PuppetFailure) resolved: (14) Puppet has failed on an-presto1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:12:37] 10Data-Platform-SRE: Update the GeoIP databases for matomo to use the same as the production pipelines - https://phabricator.wikimedia.org/T351242 (10SCampos-WMF) Thanks for updating this @BTullis. I have rechecked, and it appears that everything is functioning as intended. I believe the ongoing changes are stil... [15:12:59] (PuppetZeroResources) resolved: (2) Puppet has failed generate resources on an-presto1013:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:13:36] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10BTullis) I have now implemented this via https://gerrit.wikimedia.org/r/975006 I used the parameter: `spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version`... [15:21:22] headsup, I'm going to reimage an-druid1002 [15:22:52] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brouberol@cumin1001 for host an-druid1002.eqiad.wmnet with OS bullseye [15:36:26] 10Data-Platform-SRE: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1008.wikimedia.org with OS bullseye completed: - cloudelastic1008 (**PASS**) - Removed fr... [15:55:55] 10Data-Platform-SRE: Upgrade the druid-analytics cluster to bullseye - https://phabricator.wikimedia.org/T332604 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brouberol@cumin1001 for host an-druid1002.eqiad.wmnet with OS bullseye completed: - an-druid1002 (**PASS**) - Downtimed on Icin... [15:56:09] an-druid1002 reimaging process is done. The zk ensemble is back to 3/3 [15:58:46] (EventgateValidationErrors) resolved: ... [15:58:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [15:58:57] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:00:15] (EventgateValidationErrors) firing: ... [16:00:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:10:16] (EventgateValidationErrors) resolved: ... [16:10:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [16:14:30] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 2.925% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [16:21:29] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:... [16:21:52] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye executed with errors:... [16:39:58] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [16:58:22] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye [17:06:34] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform: Undelete of page with same title leads to unexpected results - https://phabricator.wikimedia.org/T351411 (10Krinkle) The "Delete page" and "Undelete" functionality is first and foremost a system to archive/restore revisions that additionally ensures... [17:16:53] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [17:18:21] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:... [17:19:41] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [17:20:08] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:... [17:24:02] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye [17:26:16] (EventgateValidationErrors) firing: ... [17:26:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:28:27] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1009.wikimedia.org with OS bullseye completed: - cloudela... [17:28:31] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10JAllemandou) Thanks a lot @BTullis - The problem you linked is indeed a known issue. We rely on hive-metastore and _SUCCESS files which should prevent the issue on p... [17:29:37] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [17:36:16] (EventgateValidationErrors) resolved: ... [17:36:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [17:44:42] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [17:55:43] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10JAllemandou) After giving it a few more thought, it seems that NOT changing the parameter globally to enforce data-correctness in folders is the best idea. We would... [17:57:03] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10JAllemandou) ping @mpopov , @xcollazo , @Ottomata and @Milimetric :) [18:14:59] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [18:15:11] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [18:20:48] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10xcollazo) (IIRC, the change in the commit strategy proposed in `v2` were made to better support object stores, given than a `mv` in HDFS is cheap and just metadata,... [18:30:16] (EventgateValidationErrors) firing: ... [18:30:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:40:16] (EventgateValidationErrors) resolved: ... [18:40:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:44:10] 10Data-Platform-SRE, 10Discovery-Search (Current work): Test backfilling for cirrus-streaming-updater - https://phabricator.wikimedia.org/T350826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin2002 for host cloudelastic1010.wikimedia.org with OS bullseye executed with errors:... [18:46:00] (EventgateValidationErrors) firing: ... [18:46:01] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [18:50:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:53:22] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [18:54:46] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:00:46] (EventgateValidationErrors) resolved: ... [19:00:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:04:11] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye executed with errors: - aqs1012 (**FAIL**) - Removed from Puppet... [19:10:29] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye [19:11:45] (EventgateValidationErrors) firing: ... [19:11:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:21:45] (EventgateValidationErrors) resolved: ... [19:21:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:24:15] (EventgateValidationErrors) firing: ... [19:24:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:29:15] (EventgateValidationErrors) resolved: ... [19:29:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:29:45] (EventgateValidationErrors) firing: ... [19:29:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:34:46] (EventgateValidationErrors) resolved: ... [19:34:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:47:45] (EventgateValidationErrors) firing: ... [19:47:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:52:45] (EventgateValidationErrors) resolved: ... [19:52:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [19:54:29] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1012.eqiad.wmnet with OS bullseye completed: - aqs1012 (**WARN**) - Removed from Puppet and PuppetD... [19:59:47] (DiskSpace) firing: Disk space an-web1001:9100:/srv 5.299% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=an-web1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [19:59:47] (DiskSpace) firing: (3) Disk space druid1009:9100:/srv 5.908% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:04:47] (DiskSpace) resolved: (3) Disk space druid1009:9100:/srv 5.981% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [20:07:46] (EventgateValidationErrors) firing: ... [20:07:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:17:46] (EventgateValidationErrors) resolved: ... [20:17:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:19:15] (EventgateValidationErrors) firing: ... [20:19:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:22:10] (03Abandoned) 10Conniecc1: Create mediawiki/wiki_highlights_experiment [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658 (https://phabricator.wikimedia.org/T348613) (owner: 10Conniecc1) [20:34:15] (EventgateValidationErrors) resolved: ... [20:34:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [20:36:14] (03PS1) 10Conniecc1: Create mediawiki/wiki_highlights_experiment This schema is required for instrumenting the Wiki Highlights experiment that the Inuka team is working on. Bug: T348613 [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/975079 (https://phabricator.wikimedia.org/T348613) [20:58:57] RECOVERY - Disk space on druid1010 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1010&var-datasource=eqiad+prometheus/ops [21:01:23] RECOVERY - Disk space on druid1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1011&var-datasource=eqiad+prometheus/ops [21:01:39] RECOVERY - Disk space on druid1009 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=druid1009&var-datasource=eqiad+prometheus/ops [21:43:15] (EventgateValidationErrors) firing: ... [21:43:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:48:15] (EventgateValidationErrors) resolved: ... [21:48:16] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:50:24] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10Jclark-ctr) [21:54:45] (EventgateValidationErrors) firing: ... [21:54:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [21:58:07] 10Data-Platform-SRE, 10Patch-For-Review: Add a spark global config for better file commit strategy - https://phabricator.wikimedia.org/T351388 (10Milimetric) +1 for leaving writing to Hive tables alone (and erring towards correctness and jobs failing and hopefully comments that we can find) +1 to instead focus... [22:04:45] (EventgateValidationErrors) resolved: ... [22:04:46] eventgate-analytics-external stream eventlogging_WMDEBannerSizeIssue validation errors detected in past 15 min - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos - https://alerts.wikimedia.org/?q=alertname%3DEventgateValidationErrors [22:09:28] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [22:17:10] 10Data-Engineering, 10Anti-Harassment, 10Growth-Team, 10MediaWiki-extensions-EventLogging, and 6 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx) [22:17:56] 10Data-Platform-SRE, 10Discovery-Search: Reduce network impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) [22:18:13] 10Data-Platform-SRE, 10Discovery-Search: Reduce network impact of Elastic snapshots - https://phabricator.wikimedia.org/T351475 (10bking) [22:20:49] (03PS5) 10Milimetric: Introduce MostTranscludedPages.hql [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [22:22:05] 10Data-Engineering, 10Anti-Harassment, 10Growth-Team, 10MediaWiki-extensions-EventLogging, and 6 others: [EPIC] Deprecate EventLogging::logEvent() - https://phabricator.wikimedia.org/T318263 (10phuedx) [22:23:59] (03CR) 10Milimetric: "new changes include outputting to a directory instead of a cassandra table and writing json. Now what's left to do is to change the Airfl" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [22:50:59] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on matomo1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:54:47] (SystemdUnitFailed) firing: (4) wmf_auto_restart_prometheus-mysqld-exporter.service Failed on an-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:55:32] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10bking) 05Open→03Resolved Confirmed working, closing... [23:20:52] 10Data-Engineering, 10MediaWiki-General, 10Event-Platform, 10MediaWiki-Platform-Team (Radar): Undelete of page with same title leads to unexpected results - https://phabricator.wikimedia.org/T351411 (10Krinkle) [23:29:39] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye executed with errors: - an-worke... [23:30:14] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1157.eqiad.wmnet with OS bullseye [23:33:55] (03CR) 10Ladsgroup: "Dan: You haven't uploaded the latest version :P" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/957899 (https://phabricator.wikimedia.org/T309738) (owner: 10Ladsgroup) [23:51:55] 10Data-Engineering, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install an-worker11[57-75] - https://phabricator.wikimedia.org/T349936 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host an-worker1158.eqiad.wmnet with OS bullseye