[00:34:50] (SystemdUnitFailed) firing: (8) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:54] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_legacy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:38:31] (SystemdUnitFailed) firing: (8) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:28] hi folks! [07:44:39] going to keep restarting jumbo brokers to pick up new TLS certs [07:53:31] (SystemdUnitFailed) firing: (8) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:15] !log fix old envoyproxy monitor for an-test-ui1001 [08:01:17] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:31:41] Ack elukey [09:21:14] kafka jumbo runs on PKI now! [09:21:22] it took a while but we are finally there :) [09:30:47] steve_munene: if you have doubts/questions/etc.. about Kafka and PKI please lemme know [09:30:59] we can discuss everything in here anytime [09:31:01] :) [10:16:12] steve_munene: another thing - there are several alerts in https://alerts.wikimedia.org related to Data Engineering [10:16:24] do you have time to follow up on them? [10:16:33] I can help of course :) [10:17:04] (nothing major afaics but it causes some confusion in the summary page) [10:36:40] Yay Kafka on PKI \o/ Thanks elukey [10:36:52] Having a look at the alerts [11:53:31] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:28:29] elukey: thank you so much! [12:59:55] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:03:36] (SystemdUnitFailed) resolved: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:06:31] (SystemdUnitFailed) firing: (2) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:11:36] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:16:31] (SystemdUnitFailed) resolved: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:17:31] (SystemdUnitFailed) firing: (6) hadoop-yarn-nodemanager.service Failed on an-test-worker1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:21:47] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:31:16] ottomata: \o/ [13:53:14] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Vgutierrez) [14:12:25] ottomata: o/ me and steve_munene were trying to see why the yarn-nodemanager doesn't work on an-test-worker1001, and it seems due to the absence of the spark2 package, that contains the spark shuffler jar that yarn uses [14:16:05] is there a plan to add it from another source? [14:20:32] o/ on the dse-k8s cluster some prometheus metrics like container_memory_usage_bytes do not seem to be collected for my namespace (rdf-streaming-updater), it's visible for some tho (spark-operator and few others) [14:26:12] dcausse: weird, it should collect those metrics automatically [14:32:17] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:47:16] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:02:40] elukey: re an-test-worker, I think btullis is workin on bullsye upgrade, and I think I saw that he was not going to include spark2 for bullsyte [16:02:41] but [16:02:56] maybe we have to finish finish the spark3 upgrade first? and actually use the spark3 shuffler? [16:04:07] ah okok, maybe we could remove the option from yarn-site temporarily, but not sure what/if/how the node manager uses the shuffler or not [16:37:24] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10JArguello-WMF) [16:42:54] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Setup config to allow lineage instrumentation - https://phabricator.wikimedia.org/T333004 (10JArguello-WMF) [16:46:52] 10Data-Engineering-Planning, 10Data Pipelines: Airflow ArchiveOperator should have a number of retries of 0 - https://phabricator.wikimedia.org/T332216 (10JArguello-WMF) [16:47:07] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Airflow ArchiveOperator should have a number of retries of 0 - https://phabricator.wikimedia.org/T332216 (10JArguello-WMF) [16:51:25] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Support for moving data from HDFS to public http file server - https://phabricator.wikimedia.org/T317167 (10lbowmaker) [16:55:01] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:55:58] 10Data-Engineering-Planning, 10Data Pipelines (Sprint 11): Delete empty tables unique_devices_*_wide_* - https://phabricator.wikimedia.org/T329978 (10JArguello-WMF) a:05JAllemandou→03None [17:11:19] (03PS6) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/personalized_praise [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) [17:11:22] (03CR) 10Urbanecm: Add analytics/mediawiki/mentor_dashboard/personalized_praise (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/891368 (https://phabricator.wikimedia.org/T325117) (owner: 10Urbanecm) [17:45:50] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper for - https://phabricator.wikimedia.org/T331283 (10Ottomata) [17:45:57] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ottomata) [17:47:54] 10Data-Engineering, 10serviceops-radar, 10Event-Platform Value Stream (Sprint 11): Store Flink HA metadata in Zookeeper - https://phabricator.wikimedia.org/T331283 (10Ottomata) @dcausse @bking we aren't quite ready to test this for mediawiki-page-content-change-enrichment (need T330693 before we can even do... [18:16:36] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11): eventutilities-python: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) [18:19:15] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11): mediawiki-event-enrichment issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) [18:30:43] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11): mediawiki-event-enrichment issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) > Next step will be measuring latency/throughput on YARN and possibly tune settings (batch size, thread pool size). If... [18:31:08] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 11): mediawiki-event-enrichment: issue async requests from MapFunction context - https://phabricator.wikimedia.org/T332948 (10gmodena) [20:56:16] (SystemdUnitFailed) firing: (6) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:12:18] 10Data-Engineering, 10DBA, 10Infrastructure-Foundations, 10Machine-Learning-Team, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10bking) [23:10:01] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:16] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:26:17] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:30:01] (SystemdUnitFailed) firing: (7) monitor_refine_eventlogging_legacy.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed