[00:32:43] <jinxer-wm>	 (SystemdUnitFailed) firing: monitor_refine_eventlogging_legacy_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:13:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[04:15:08] <wikibugs>	 (03PS3) 10Conniecc1: T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304
[04:15:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1)
[04:38:26] <wikibugs>	 (03PS1) 10Conniecc1: add latest.yaml [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658
[04:39:53] <wikibugs>	 (03PS2) 10Conniecc1: T348613 add latest.yaml [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966658
[04:40:20] <wikibugs>	 (03Abandoned) 10Conniecc1: T348613 Add new wiki_highlights_experiments schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966304 (owner: 10Conniecc1)
[05:17:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) monitor_refine_event_test.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:42:51] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[05:43:51] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[05:47:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:10:51] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[07:01:36] <aqu>	 !log Started deploy [airflow-dags/analytics@5dcce3b]: Add missing MR in yesterday weekly train
[07:01:38] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[08:24:44] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): qlever dblp endpoint for wikidata federated query nomination - https://phabricator.wikimedia.org/T339347 (10Gehel) a:05Gehel→03None
[08:33:57] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 (10Gehel) p:05Triage→03High
[08:34:24] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10Gehel) p:05Triage→03Medium
[08:34:39] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Follow up on rdf-streaming-updater failure 2023-10-17 - https://phabricator.wikimedia.org/T349147 (10Gehel) p:05Triage→03High
[08:37:23] <wikibugs>	 (03CR) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx)
[08:39:13] <wikibugs>	 10Data-Platform-SRE: Address illegal reflective access on apifeatureusage* - https://phabricator.wikimedia.org/T348696 (10Gehel) p:05Triage→03Low
[08:40:08] <wikibugs>	 10Data-Platform-SRE: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 (10Gehel) p:05Triage→03Low
[08:40:33] <wikibugs>	 10Data-Platform-SRE, 10Discovery-Search: Track and clean up object storage used by rdf-streaming-updater - https://phabricator.wikimedia.org/T348685 (10Gehel) p:05Triage→03Medium
[08:46:20] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data-Services, 10cloud-services-team: Some wikibase tables not available in commonswiki_p - https://phabricator.wikimedia.org/T298452 (10Gehel) p:05Triage→03High
[08:46:43] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Enforce authentication and authorization for webrequest_* topics in Kafka jumbo-eqiad cluster - https://phabricator.wikimedia.org/T294264 (10Gehel) p:05Triage→03Medium
[08:47:56] <wikibugs>	 10Data-Platform-SRE: Enforce authentication for Druid datasources - https://phabricator.wikimedia.org/T255545 (10Gehel) p:05Triage→03Medium
[08:49:26] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search, and 2 others: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10Gehel) p:05Triage→03Medium
[08:49:36] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10Discovery-Search, and 2 others: [Epic] Set up multi DC Kafka stretch cluster - https://phabricator.wikimedia.org/T340492 (10Gehel)
[08:49:47] <wikibugs>	 10Data-Platform-SRE: Configure load-balancing approriate for ceph radosgw services on the data-engineering cluster - https://phabricator.wikimedia.org/T330153 (10Gehel) p:05Triage→03Low
[08:50:01] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10Gehel) p:05Triage→03Low
[08:50:21] <wikibugs>	 (03PS1) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966815 (https://phabricator.wikimedia.org/T344833)
[08:51:42] <wikibugs>	 (03PS20) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/952252 (https://phabricator.wikimedia.org/T344833)
[08:51:50] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: spark3 in yarn master mode exhibits warnings when the HDFS namenodes are in the failed over state - https://phabricator.wikimedia.org/T338137 (10Gehel) p:05Triage→03Low
[08:52:15] <wikibugs>	 (03Abandoned) 10Phuedx: Add analytics/metrics_platform/{app,web}/base schemas [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966815 (https://phabricator.wikimedia.org/T344833) (owner: 10Phuedx)
[08:54:50] <wikibugs>	 10Data-Engineering-Icebox, 10Data-Platform-SRE, 10Observability-Logging, 10Wikimedia-Logstash: Evaluate storing logs from applications in yarn with the typical logging infrastructure - https://phabricator.wikimedia.org/T300937 (10Gehel) 05Open→03Declined This seems like a major investment given the dat...
[08:55:34] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Send a critical alert to data-engineering if produce_canary_events isn't running correctly - https://phabricator.wikimedia.org/T337055 (10Gehel)
[08:55:36] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel)
[08:55:59] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE: Send a critical alert to data-engineering if produce_canary_events isn't running correctly - https://phabricator.wikimedia.org/T337055 (10Gehel) p:05Triage→03High
[08:56:09] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10Gehel)
[08:56:11] <wikibugs>	 10Data-Platform-SRE, 10observability, 10Epic: [Epic] Review alerting strategy for Data Platform SRE - https://phabricator.wikimedia.org/T346438 (10Gehel)
[08:56:49] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Observability-Alerting: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 (10Gehel) p:05Triage→03Medium
[08:58:17] <btullis>	 A quick reminder that I will be rebooting  stat100[4,6,7,9] in the next few minutes.
[09:07:40] <btullis>	 !log rebooting stat1004
[09:07:41] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:14:17] <btullis>	 !log rebooting stat100[6-7]
[09:14:18] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[09:32:11] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Improve data-reload cookbook based on graph split needs - https://phabricator.wikimedia.org/T349011 (10dcausse) This is not a general problem, we want this particular order only when we want to be aligned with what we have in hdfs which is required in the following scen...
[09:35:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[09:47:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:11:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[10:40:24] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/5  Use minor versions for sy...
[10:40:43] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/5  Use minor versions for sy...
[11:05:10] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis opened https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/6  Fix the postinst script w...
[11:05:23] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10CodeReviewBot) btullis merged https://gitlab.wikimedia.org/repos/data-engineering/spark/-/merge_requests/6  Fix the postinst script w...
[11:30:09] <btullis>	 If anyone is around to give a quick review to this, I'd be grateful: https://gerrit.wikimedia.org/r/c/operations/puppet/+/966853
[11:30:46] <btullis>	 It's a noop to the production hadoop cluster, but will hopefully fix a problem with running multiple yarn shufflers in hadoop_test
[11:30:48] <btullis>	 Thanks.
[11:50:28] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[11:51:08] <wikibugs>	 10Data-Platform-SRE, 10Data Engineering and Event Platform Team: Airflow scheduler and webserver logs should be readable by airflow instance admins - https://phabricator.wikimedia.org/T304615 (10BTullis) a:05BTullis→03None
[12:11:05] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10DBA: Recover dbstore1007:s2 from the database provisioning service - https://phabricator.wikimedia.org/T343109 (10BTullis) >>! In T343109#9260834, @jcrespo wrote: > You will need to handle users and data- the s2 backup is for a production host, so you will need to r...
[12:28:12] <joal>	 dsaez: Hi Diego - another ping about files on HDFS - could you please reach out to me?
[12:37:02] <joal>	 btullis: Heya - I assume the problem of the spark shuffle on the test cluster is still ongoing?
[12:39:46] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10Stevemunene) >>! In T340648#9241654, @Manuel wrote: > Hi @Stevemunene and @BTullis, thank you again for making us aware of this important issue! It turns out that these timers are i...
[12:45:34] <btullis>	 joal: Yes.  I'm hoping that this will resolve it: https://gerrit.wikimedia.org/r/c/operations/puppet/+/966853
[12:46:04] <joal>	 ack btullis - would you mind sending an email to the alert mailing list telling folks that alerts from the test cluster are under control?
[12:46:16] <joal>	 We've received many failures since ye3sterday :)
[12:46:47] <btullis>	 I will. I wasn't sure that it was the spark shufflers until a littl earlier today.
[12:47:50] <btullis>	 To be honest, I'm still not 100% sure that it was related to the spark shufflers and not yesterday's deploys but I think that the spark shufflers are more likely.
[12:48:25] <joal>	 the problem is impacting every job running on the test cluster - the dpeloy was not that impacful :)
[12:57:45] <wikibugs>	 10Data-Engineering, 10Tool-Pageviews, 10Data Products (Sprint 02): Mediarequests returning "file not found" for filenames with specific characters - https://phabricator.wikimedia.org/T347899 (10Sfaci) Ok! No worries! Just waiting for some sample data to test some edge cases before pushing a fix for all this....
[13:38:25] <wikibugs>	 10Data-Engineering, 10Data-Platform-SRE, 10Data Engineering and Event Platform Team, 10SRE, and 4 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10dcausse)
[13:47:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[13:49:52] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Wikimedia-production-error: Error: Call to a member function exists() on null (via EventBus PageChangeEventSerializer) - https://phabricator.wikimedia.org/T346355 (10lbowmaker)
[13:50:58] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10EventStreams, 10Event-Platform: Event streams don't respect milliseconds UTC unix epoch timestamp in since parameter - https://phabricator.wikimedia.org/T345606 (10lbowmaker)
[13:58:18] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: EventGate occasionally fails to ingest specific schemas - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9260111, @Ottomata wrote: > Perhaps there is a race cond...
[14:04:48] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) The nodemanager isn't correctly loading the shuffler services. ` 2023-10-18 13:26:15,615 INFO org.apache.hadoop.util.Applica...
[14:11:06] <jinxer-wm>	 (HdfsTotalFilesHeap) firing: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[14:12:33] <wikibugs>	 10Data-Platform-SRE, 10Wikidata, 10Wikidata-Query-Service: Migrate staging rdf-streaming-updater to flink operator - https://phabricator.wikimedia.org/T349095 (10bking)      Create a savepoint by incrementing the nonce value in the helmfile.d/dse-k8s-services/values.yaml and deploy     Destroy the deployment...
[14:27:38] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) Thanks Sam!  Also wondering how setInter...
[14:27:44] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[14:27:54] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform, 10Patch-For-Review: Enable canary events for all streams - https://phabricator.wikimedia.org/T266798 (10dcausse) A quick comment to mention that consumers of fully idle streams (streams without canary events) a...
[14:28:06] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[14:29:00] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata) I'm not entirely sure this is the right solution. Perhaps [[ https://phabricator.wikimedia.or...
[14:29:37] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventgate: cache refreshes should fetch stream configs in batches - https://phabricator.wikimedia.org/T346899 (10Ottomata)
[14:29:51] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata)
[14:39:00] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: Deploy additional yarn shuffler services to support several versions of spark in parallel - https://phabricator.wikimedia.org/T344910 (10BTullis) I have got it to load the second two shufflers correctly, with the custom port numbers, but the first one seems to be ignori...
[14:43:10] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10CodeReviewBot) gmodena updated https://gitlab.wikimedia.org/repos/data-engineering/eventu...
[14:44:33] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye
[14:58:12] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena) >>! In T347282#9258113, @Ottomata wrote: > Q: at the moment, as is, we don't act...
[14:59:03] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: eventutilites-python: improve consistency guarantees of async process functions - https://phabricator.wikimedia.org/T347282 (10gmodena)
[14:59:35] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye executed with errors: - aqs1010 (**FAIL**)   - Downtimed on Icinga/...
[15:02:21] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye
[15:30:17] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) a:03gmodena
[15:39:54] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10gmodena)
[15:44:13] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, 10Patch-For-Review: Enable snappy compression for Flink Kafka producers - https://phabricator.wikimedia.org/T345805 (10CodeReviewBot) gmodena merged https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-eve...
[15:46:04] <wikibugs>	 10Data-Platform-SRE, 10Cloud-VPS, 10SRE, 10cloud-services-team, 10ops-eqiad: Move cloudvirt-wdqs hosts - https://phabricator.wikimedia.org/T346948 (10taavi) @jclark-ctr these need a single NIC connected to the `cloud-hosts` as the primary VLAN, and `cloud-instances` and `cloud-private` VLANs trunked (we...
[15:47:30] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye executed with errors: - aqs1010 (**FAIL**)   - Removed from Puppet...
[15:49:41] <inflatador>	 gmodena btullis I'm in process of removing the rdf-streaming-updater experiment from dse-k8s . Should we keep the flink-operator running in dse-k8s? I don't see anything currently using it besides the updater
[15:50:22] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye
[15:53:31] <wikibugs>	 (03PS3) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767)
[15:56:21] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team, 10Event-Platform: mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10gmodena) a:03gmodena
[15:58:21] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) >>! In T345806#9215806, @Ottomata wrote: > Would it be worth partitioning mediawiki.page_change.v1...
[15:59:25] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mediawiki.page_content_change.v1 topic should be partitioned. - https://phabricator.wikimedia.org/T345806 (10gmodena) Picking this up. I'll do T338231 first, since is a requriement. Moving T338231 into this sprint.
[16:00:38] <wikibugs>	 10Data-Engineering, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform: mw-page-content-change-enrich should (re)produce kafka keys - https://phabricator.wikimedia.org/T338231 (10gmodena)
[16:24:32] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10phuedx) >>! In T326002#9262081, @Ottomata wrote: >...
[16:33:17] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by eevans@cumin1001 for host aqs1010.eqiad.wmnet with OS bullseye completed: - aqs1010 (**WARN**)   - Removed from Puppet and PuppetD...
[16:53:58] <stevemunene>	 !log Add analytics-wmde service user to the Yarn production queue T340648
[16:54:01] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[16:54:01] <stashbot>	 T340648: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648
[16:59:46] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+2] Add a new init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966289 (https://phabricator.wikimedia.org/T243641) (owner: 10DLynch)
[17:00:27] <wikibugs>	 (03Merged) 10jenkins-bot: Add a new init_mechanism to editattemptstep [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/966289 (https://phabricator.wikimedia.org/T243641) (owner: 10DLynch)
[17:05:28] <icinga-wm>	 PROBLEM - Check systemd state on an-airflow1007 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_airflow-kerberos@wmde.service,wmf_auto_restart_airflow-scheduler@wmde.service,wmf_auto_restart_airflow-webserver@wmde.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:05:52] <icinga-wm>	 PROBLEM - Checks that the local airflow scheduler for airflow @wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow jobs check --job-type SchedulerJob --hostname an-airflow1007.eqiad.wmnet did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:05:58] <icinga-wm>	 PROBLEM - Checks that the airflow database for airflow wmde is working properly on an-airflow1007 is CRITICAL: CRITICAL: /usr/bin/env AIRFLOW_HOME=/srv/airflow-wmde /usr/lib/airflow/bin/airflow db check did not succeed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Airflow
[17:06:32] <stevemunene>	 --^ We are currently working on an-airflow1007 with ryankemper 
[17:06:47] <jinxer-wm>	 (SystemdUnitCrashLoop) firing: (3)  crashloop on an-airflow1007:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[17:07:34] <wikibugs>	 10Data-Platform-SRE, 10Patch-For-Review: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9120ee3d-6b27-4a2d-bc4d-e4f592882343) set by stevemunene@cumin1001 for 1 day, 0:00:00 on 1 host(s) and their service...
[17:14:57] <stevemunene>	 o/ mforns we're working on the WMDE airflow instance with ryankemper and have run into some errors on the deploy host are you available for a quick look?
[17:15:22] <stevemunene>	 https://www.irccloud.com/pastebin/YjA466Py/
[17:15:37] <mforns>	 hi stevemunene! yes, I can try to help. 
[17:16:01] <stevemunene>	 cool were on https://meet.google.com/tgm-vnmy-tth
[17:17:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:19:40] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:23:28] <stevemunene>	 o/ xcollazo 
[17:30:17] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:43] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[17:43:06] <tchin>	 !log deploying mw-page-content-change-enrich
[17:43:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[17:47:29] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[17:55:10] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans)
[17:55:38] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: AQS fails on Debian Bullseye (Node 12) - https://phabricator.wikimedia.org/T349228 (10Eevans) p:05Triage→03High
[17:55:51] <jinxer-wm>	 (HdfsTotalFilesHeap) resolved: Total files on the analytics-hadoop HDFS cluster are more than the heap can support. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts#HDFS_total_files_and_heap_size - https://grafana.wikimedia.org/d/000000585/hadoop?var-hadoop_cluster=analytics-hadoop&orgId=1&panelId=28&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DHdfsTotalFilesHeap
[17:57:34] <btullis>	 I have reverted the change to the multiple spark shufflers on the test cluster.
[18:03:24] <stevemunene>	 !log revert Add analytics-wmde service user to the Yarn production queue T340648
[18:03:26] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans) p:05Medium→03High
[18:03:27] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:03:27] <stashbot>	 T340648: [Airflow] Setup Airflow instance for WMDE - https://phabricator.wikimedia.org/T340648
[18:05:10] <wikibugs>	 10Data-Platform-SRE, 10Cassandra: Upgrade AQS cluster to Bullseye - https://phabricator.wikimedia.org/T347738 (10Eevans)
[18:08:15] <wikibugs>	 (03CR) 10Joal: [C: 03+1] "LGTM! Thanks Dan for having taken the time to read through my over-engineered code :D" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric)
[19:08:44] <wikibugs>	 (03PS1) 10Milimetric: Update schema of mediawiki_wikitext_* [analytics/refinery] - 10https://gerrit.wikimedia.org/r/966914 (https://phabricator.wikimedia.org/T348767)
[19:10:35] <wikibugs>	 (03PS4) 10Milimetric: Improve fidelity of dumps import [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767)
[19:12:10] <wikibugs>	 (03CR) 10Milimetric: "we will decide on optimization skipping JSON serialization later, for now this job is being tested on enwiki/rowiki/simplewiki and looking" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/965792 (https://phabricator.wikimedia.org/T348767) (owner: 10Milimetric)
[20:00:05] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) > I like the simplicity of this proposal...
[20:38:49] <wikibugs>	 10Data-Engineering, 10Data Pipelines, 10Data Engineering and Event Platform Team (Sprint 3), 10Event-Platform, and 2 others: eventgate-wikimedia occasionally fails to produce events due to stream config errors - https://phabricator.wikimedia.org/T326002 (10Ottomata) > I might be able to add caching of matc...
[21:27:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[21:32:58] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) monitor_refine_event_sanitized_main_test_immediate.service Failed on an-test-coord1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:52:28] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:45:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage
[23:49:57] <wikibugs>	 10Data-Engineering, 10MediaWiki-Page-deletion, 10MediaWiki-extensions-UserMerge, 10Event-Platform, and 4 others: ArticleDeleteComplete and PageDeleteComplete hooks receive a WikiPage with inconsistent redirect data - https://phabricator.wikimedia.org/T348881 (10Jdforrester-WMF) 05Open→03Resolved
[23:58:23] <wikibugs>	 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10CodeReviewBot) jforrester opened https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-orchestrator/-/me...
[23:59:33] <wikibugs>	 10Data-Engineering, 10Abstract Wikipedia team, 10CX-cxserver, 10Citoid, and 8 others: Migrate node-based services in production to node18 - https://phabricator.wikimedia.org/T349118 (10CodeReviewBot) jforrester opened https://gitlab.wikimedia.org/repos/abstract-wiki/wikifunctions/function-evaluator/-/merge...