[00:32:43] (SystemdUnitFailed) firing: monitor_refine_eventlogging_legacy_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:17:42] (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:13:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [04:15:08] (03PS3) 10Conniecc1: T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 [04:15:38] (03CR) 10CI reject: [V: 04-1] T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1) [04:38:26] (03PS1) 10Conniecc1: add latest.yaml [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658 [04:39:53] (03PS2) 10Conniecc1: T348613 add latest.yaml [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658 [04:40:20] (03Abandoned) 10Conniecc1: T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1) [05:17:42] (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:42:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [05:43:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [05:47:42] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:51] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [07:01:36] !log Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train [07:01:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:24:44] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) a:05Gehel→03None [08:33:57] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 (10Gehel) p:05Triage→03High [08:34:24] 10Data-Platform-SRE, 10Patch-For-Review: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10Gehel) p:05Triage→03Medium [08:34:39] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10Gehel) p:05Triage→03High [08:37:23] (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [08:39:13] 10Data-Platform-SRE: Address illegal reflective access on apifeatureusage* - https://phabricator.wikimedia.org/T348696 (10Gehel) p:05Triage→03Low [08:40:08] 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10Gehel) p:05Triage→03Low [08:40:33] 10Data-Platform-SRE, 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10Gehel) p:05Triage→03Medium [08:46:20] 10Data-Engineering, 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Some wikibase tables not available in commonswiki_p - https://phabricator.wikimedia.org/T298452 (10Gehel) p:05Triage→03High [08:46:43] 10Data-Engineering, 10Data-Platform-SRE: Enforce authentication and authorization for webrequest_* topics in Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T294264 (10Gehel) p:05Triage→03Medium [08:47:56] 10Data-Platform-SRE: Enforce authentication for Druid datasources - https://phabricator.wikimedia.org/T255545 (10Gehel) p:05Triage→03Medium [08:49:26] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search, and 2 others: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10Gehel) p:05Triage→03Medium [08:49:36] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search, and 2 others: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10Gehel) [08:49:47] 10Data-Platform-SRE: Configure load-balancing approriate for ceph radosgw services on the data-engineering cluster - https://phabricator.wikimedia.org/T330153 (10Gehel) p:05Triage→03Low [08:50:01] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10Gehel) p:05Triage→03Low [08:50:21] (03PS1) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966815 (https://phabricator.wikimedia.org/T344833) [08:51:42] (03PS20) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) [08:51:50] 10Data-Engineering, 10Data-Platform-SRE: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10Gehel) p:05Triage→03Low [08:52:15] (03Abandoned) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966815 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx) [08:54:50] 10Data-Engineering-Icebox, 10Data-Platform-SRE, 10Observability-Logging, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10Gehel) 05Open→03Declined This seems like a major investment given the dat... [08:55:34] 10Data-Engineering, 10Data-Platform-SRE: Send a critical alert to data-engineering if produce_canary_events isn't running correctly - https://phabricator.wikimedia.org/T337055 (10Gehel) [08:55:36] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [08:55:59] 10Data-Engineering, 10Data-Platform-SRE: Send a critical alert to data-engineering if produce_canary_events isn't running correctly - https://phabricator.wikimedia.org/T337055 (10Gehel) p:05Triage→03High [08:56:09] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10Gehel) [08:56:11] 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel) [08:56:49] 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10Gehel) p:05Triage→03Medium [08:58:17] A quick reminder that I will be rebooting stat100[4,6,7,9] in the next few minutes. [09:07:40] !log rebooting stat1004 [09:07:41] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:14:17] !log rebooting stat100[6-7] [09:14:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [09:32:11] 10Data-Platform-SRE, 10Patch-For-Review: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10dcausse) This is not a general problem, we want this particular order only when we want to be aligned with what we have in hdfs which is required in the following scen... [09:35:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [09:47:43] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [10:40:24] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/5 Use minor versions for sy... [10:40:43] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/5 Use minor versions for sy... [11:05:10] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/6 Fix the postinst script w... [11:05:23] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/6 Fix the postinst script w... [11:30:09] If anyone is around to give a quick review to this, I'd be grateful: https://gerrit.wikimedia.org/r/c/operations/puppet/+/966853 [11:30:46] It's a noop to the production hadoop cluster, but will hopefully fix a problem with running multiple yarn shufflers in hadoop_test [11:30:48] Thanks. [11:50:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [11:51:08] 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10BTullis) a:05BTullis→03None [12:11:05] 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) >>! In T343109#9260834, @jcrespo wrote: > You will need to handle users and data- the s2 backup is for a production host, so you will need to r... [12:28:12] dsaez: Hi Diego - another ping about files on HDFS - could you please reach out to me? [12:37:02] btullis: Heya - I assume the problem of the spark shuffle on the test cluster is still ongoing? [12:39:46] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9241654, @Manuel wrote: > Hi @Stevemunene and @BTullis, thank you again for making us aware of this important issue! It turns out that these timers are i... [12:45:34] joal: Yes. I'm hoping that this will resolve it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/966853 [12:46:04] ack btullis - would you mind sending an email to the alert mailing list telling folks that alerts from the test cluster are under control? [12:46:16] We've received many failures since ye3sterday :) [12:46:47] I will. I wasn't sure that it was the spark shufflers until a littl earlier today. [12:47:50] To be honest, I'm still not 100% sure that it was related to the spark shufflers and not yesterday's deploys but I think that the spark shufflers are more likely. [12:48:25] the problem is impacting every job running on the test cluster - the dpeloy was not that impacful :) [12:57:45] 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) Ok! No worries! Just waiting for some sample data to test some edge cases before pushing a fix for all this.... [13:38:25] 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10dcausse) [13:47:43] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:49:52] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Wikimedia-production-error: Error: Call to a member function exists() on null (via EventBus PageChangeEventSerializer) - https://phabricator.wikimedia.org/T346355 (10lbowmaker) [13:50:58] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10lbowmaker) [13:58:18] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9260111, @Ottomata wrote: > Perhaps there is a race cond... [14:04:48] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) The nodemanager isn't correctly loading the shuffler services. ` 2023-10-18 13:26:15,615 INFO org.apache.hadoop.util.Applica... [14:11:06] (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [14:12:33] 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking) Create a savepoint by incrementing the nonce value in the helmfile.d/dse-k8s-services/values.yaml and deploy Destroy the deployment... [14:27:38] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Thanks Sam! Also wondering how setInter... [14:27:44] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [14:27:54] 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10dcausse) A quick comment to mention that consumers of fully idle streams (streams without canary events) a... [14:28:06] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [14:29:00] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata) I'm not entirely sure this is the right solution. Perhaps [[ https://phabricator.wikimedia.or... [14:29:37] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata) [14:29:51] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) [14:39:00] 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I have got it to load the second two shufflers correctly, with the custom port numbers, but the first one seems to be ignori... [14:43:10] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/eventu... [14:44:33] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye [14:58:12] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) >>! In T347282#9258113, @Ottomata wrote: > Q: at the moment, as is, we don't act... [14:59:03] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) [14:59:35] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye executed with errors: - aqs1010 (**FAIL**) - Downtimed on Icinga/... [15:02:21] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye [15:30:17] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) a:03gmodena [15:39:54] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10gmodena) [15:44:13] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-eve... [15:46:04] 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) @jclark-ctr these need a single NIC connected to the `cloud-hosts` as the primary VLAN, and `cloud-instances` and `cloud-private` VLANs trunked (we... [15:47:30] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye executed with errors: - aqs1010 (**FAIL**) - Removed from Puppet... [15:49:41] gmodena btullis I'm in process of removing the rdf-streaming-updater experiment from dse-k8s . Should we keep the flink-operator running in dse-k8s? I don't see anything currently using it besides the updater [15:50:22] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye [15:53:31] (03PS3) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) [15:56:21] 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10gmodena) a:03gmodena [15:58:21] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) >>! In T345806#9215806, @Ottomata wrote: > Would it be worth partitioning mediawiki.page_change.v1... [15:59:25] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) Picking this up. I'll do T338231 first, since is a requriement. Moving T338231 into this sprint. [16:00:38] 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10gmodena) [16:24:32] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9262081, @Ottomata wrote: >... [16:33:17] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye completed: - aqs1010 (**WARN**) - Removed from Puppet and PuppetD... [16:53:58] !log Add analytics-wmde service user to the Yarn production queue T340648 [16:54:01] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:54:01] T340648: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 [16:59:46] (03CR) 10Bartosz Dziewoński: [C: 03+2] Add a new init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966289 (https://phabricator.wikimedia.org/T243641) (owner: 10DLynch) [17:00:27] (03Merged) 10jenkins-bot: Add a new init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966289 (https://phabricator.wikimedia.org/T243641) (owner: 10DLynch) [17:05:28] PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@wmde.service,wmf_auto_restart_airflow-scheduler@wmde.service,wmf_auto_restart_airflow-webserver@wmde.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:05:52] PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:05:58] PROBLEM - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow [17:06:32] --^ We are currently working on an-airflow1007 with ryankemper [17:06:47] (SystemdUnitCrashLoop) firing: (3) crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [17:07:34] 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9120ee3d-6b27-4a2d-bc4d-e4f592882343) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their service... [17:14:57] o/ mforns we're working on the WMDE airflow instance with ryankemper and have run into some errors on the deploy host are you available for a quick look? [17:15:22] https://www.irccloud.com/pastebin/YjA466Py/ [17:15:37] hi stevemunene! yes, I can try to help. [17:16:01] cool were on https://meet.google.com/tgm-vnmy-tth [17:17:43] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:19:40] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:23:28] o/ xcollazo [17:30:17] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:43] (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:43:06] !log deploying mw-page-content-change-enrich [17:43:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:47:29] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [17:55:10] 10Data-Platform-SRE, 10Cassandra: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) [17:55:38] 10Data-Platform-SRE, 10Cassandra: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) p:05Triage→03High [17:55:51] (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap [17:57:34] I have reverted the change to the multiple spark shufflers on the test cluster. [18:03:24] !log revert Add analytics-wmde service user to the Yarn production queue T340648 [18:03:26] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) p:05Medium→03High [18:03:27] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:03:27] T340648: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 [18:05:10] 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) [18:08:15] (03CR) 10Joal: [C: 03+1] "LGTM! Thanks Dan for having taken the time to read through my over-engineered code :D" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [19:08:44] (03PS1) 10Milimetric: Update schema of mediawiki_wikitext_* [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767) [19:10:35] (03PS4) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) [19:12:10] (03CR) 10Milimetric: "we will decide on optimization skipping JSON serialization later, for now this job is being tested on enwiki/rowiki/simplewiki and looking" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric) [20:00:05] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) > I like the simplicity of this proposal... [20:38:49] 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) > I might be able to add caching of matc... [21:27:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [21:32:58] (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:52:28] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:45:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [23:49:57] 10Data-Engineering, 10MediaWiki-Page-deletion, 10MediaWiki-extensions-UserMerge, 10Event-Platform, and 4 others: ArticleDeleteComplete and PageDeleteComplete hooks receive a WikiPage with inconsistent redirect data - https://phabricator.wikimedia.org/T348881 (10Jdforrester-WMF) 05Open→03Resolved [23:58:23] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10CodeReviewBot) jforrester opened https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/me... [23:59:33] 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10CodeReviewBot) jforrester opened https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/-/merge...