[00:01:08] 10Data-Engineering-Planning, 10DC-Ops, 10Data-Platform-SRE, 10SRE, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bull... [00:06:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:21] (03PS1) 10Milimetric: Release 2.10.1 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/930736 [00:09:11] (03CR) 10Milimetric: [C: 03+2] Release 2.10.1 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/930736 (owner: 10Milimetric) [00:10:34] (03Merged) 10jenkins-bot: Release 2.10.1 [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/930736 (owner: 10Milimetric) [00:11:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:45] (SystemdUnitFailed) firing: (6) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:09] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:21:38] 10Data-Engineering, 10Data-Engineering-Wikistats: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10stjn) [00:23:23] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10stjn) [00:24:41] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10stjn) [00:25:58] 10Data-Engineering, 10Data-Engineering-Wikistats, 10Russian-Sites: Indicate that some country data are unavailable on Wikistats - https://phabricator.wikimedia.org/T339318 (10stjn) [00:27:10] 10Data-Engineering, 10Data-Engineering-Wikistats: "Active editors by country" doesn't display numbers for Belarus, Kazakhstan, Russia - https://phabricator.wikimedia.org/T333716 (10stjn) This is because Russia, Belarus and Kazakhstan are in https://wikitech.wikimedia.org/wiki/Country_protection_list This will... [00:30:51] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:36:45] (SystemdUnitFailed) firing: (6) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:57:28] 10Data-Engineering-Planning, 10DC-Ops, 10Data-Platform-SRE, 10SRE, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye... [01:06:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:11:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:06:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:45] (SystemdUnitFailed) firing: (5) refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:45] (SystemdUnitFailed) firing: (6) monitor_refine_event_sanitized_main_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:26:45] (SystemdUnitFailed) firing: (7) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:06:45] (SystemdUnitFailed) firing: (8) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:16:45] (SystemdUnitFailed) firing: (9) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:06:45] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:06:45] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:23:37] 10Data-Engineering, 10Data Pipelines, 10Event-Platform Value Stream (Sprint 14 B): Fix wikimedia-event-utilities Guava dependencies issues - https://phabricator.wikimedia.org/T337421 (10dcausse) From my POV this is done and should be available in refinery v0.2.16 (v0.2.17 seems to be the one deploy). Unless... [08:06:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:06:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:11:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:06:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:32] !log rerun druid_load_edit_hourly to reload full snapshot [10:13:32] (GobblinLastSuccessfulRunTooLongAgo) firing: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [10:13:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:13:58] hm - looks like we have an issue on the test cluster :( [10:49:35] btullis: heya [10:49:40] (DruidSegmentsUnavailable) firing: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [10:49:53] joal: Hello. How can I help? [10:50:12] btullis: I have a quick question - could we batcave real quick? [10:50:22] Yep, on my way... [10:59:40] (DruidSegmentsUnavailable) resolved: More than 10 segments have been unavailable for edits_hourly on the druid_analytics Druid cluster. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Druid/Alerts#Druid_Segments_Unavailable - https://grafana.wikimedia.org/d/000000538/druid?refresh=1m&var-cluster=druid_analytics&panelId=49&fullscreen&orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DDruidSegmentsUnavailable [11:01:35] joal: here is the patch to revert the jar version for refine_sanitize https://gerrit.wikimedia.org/r/c/operations/puppet/+/930765 [11:01:49] reading [11:06:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:05] Looking at gobblin on the test cluster, the Gobblin_event_default_test job just looks like it's hanging. [11:10:11] https://www.irccloud.com/pastebin/2BoOPDUP/ [11:10:28] mwarf [11:10:41] let's kill it, and hope gobblin will restart ok [11:10:46] btullis: --^ [11:10:53] I think I saw if before, here: https://phabricator.wikimedia.org/T335358 [11:11:32] It was caused by a full file system on a test hadoop worker, but I didn't work out why the files were being constantly added. [11:11:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:13:09] btullis: shall I kill the yarn jobs for gobblin? [11:13:51] Hold on, I'm going to reboot an-test-worker1002 - I'd prefer to see if gobblin comes back afterwards first. [11:13:56] ack [11:14:37] 10Data-Engineering, 10Shared-Data-Infrastructure (2022-23 Q4 Wrap up): an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 (10ops-monitoring-bot) Host rebooted by btullis@cumin1001 with reason: gobblin is stuck [11:14:46] !log rebooting an-test-worker1002 for T335358 and stuck gobblin [11:14:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:14:48] T335358: an-test-worker1002 is constantly writing to /tmp - https://phabricator.wikimedia.org/T335358 [11:15:28] I'm tailing `journalctl -u gobblin-event_default_test.service -f` on an-test-coord1001 and we will see what happens when an-test-worker1002 comes back. [11:22:09] joal: I *think* that gobblin sorted itself out once the worker came back. https://grafana.wikimedia.org/d/pAQaJwEnk/gobblin?orgId=1&refresh=15m&var-gobblin_job_name=event_default_test&var-kafka_topic=All&from=now-6h&to=now [11:22:38] Indeed btullis - the job I was monitoring finisshed succesfully [11:22:39] I'd appreciate a double-check on that, if and when you have time. [11:22:40] awesome [11:23:32] (GobblinLastSuccessfulRunTooLongAgo) resolved: (3) Last successful gobblin run of job event_default_test was more than 2 hours ago. - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Gobblin - https://alerts.wikimedia.org/?q=alertname%3DGobblinLastSuccessfulRunTooLongAgo [11:23:42] \o/ [11:24:28] Right, I'll have a look at those refine_sanitized timers next, to see if they need any re-runs. [11:24:44] thanks a milion [11:24:57] I'm investigating the problem of spark3 [11:26:45] (SystemdUnitFailed) firing: (12) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:27:24] hm --^ this not great [11:27:33] this is supposed to be solved [11:29:18] Right, but this is the monitor for the job that only runs at 06:00 isn't it? So the re-run hasn't happened with the new jars. [11:29:30] ah ok - my bad sorry [11:30:44] Nono, it's fine. I think that this is the effect of the way that systemd jobs are now pinging repeatedly in IRC. The sooner we can get these refine jobs into airflow, the better, I think. [11:30:58] makes sense! [11:31:44] joal: ref T337052 [11:31:48] T337052: Reduce IRC/alert noise associated with monitor_refine_ systemd timers from alertmanager - https://phabricator.wikimedia.org/T337052 [11:36:45] (SystemdUnitFailed) firing: (11) monitor_refine_event_sanitized_analytics_delayed.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:43:53] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) [11:47:14] 10Quarry, 10superset.wmcloud.org, 10cloud-services-team (FY2022/2023-Q4): Replace Quarry with an installation of Superset - https://phabricator.wikimedia.org/T169452 (10rook) >>! In T169452#8936883, @Stuartyeates wrote: > I'm aware that our superset install doesn't currently do caching of resultsets, but the... [12:03:37] !log restarting refine_event_sanitized_analytics_delayed.service on an-launcher1002 [12:03:38] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:06:45] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:11:28] !log restarting refine_event_sanitized_main_delayed.service on an-launcher1002 [12:11:29] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:11:45] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:45] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:18:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:47] !log restarting the remaining monitor_refine_event_sanitized_analytics_immediate.service monitor_refine_event_sanitized_main_delayed.service monitor_refine_event_sanitized_main_immediate.service services on an-launcher1002 [12:18:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:21:45] (SystemdUnitFailed) firing: (10) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:26:45] (SystemdUnitFailed) firing: (7) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:27:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event_sanitized_analytics_immediate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:51:45] (SystemdUnitFailed) firing: (4) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:56:45] (SystemdUnitFailed) firing: (4) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:55:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:45] (SystemdUnitFailed) firing: (4) monitor_refine_event_sanitized_analytics_immediate.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:05:02] problem found btullis (about refine-sanitized)! [14:05:17] We'll fix that on monday :) [14:05:27] Ah great! You mean about the spark3 compatibility? [14:05:34] YEs [14:06:22] this line: https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine_sanitize.pp#L36 [14:06:36] should mention "refinery-job-${refinery_version}-shaded.jar" [14:06:43] The job uses the non-shaded jar [14:06:58] We can try now if you wish, or monday is fine :) [14:07:16] I need to drop for kids, so I guess monday is better (plus en of day friday, no change :) [14:08:55] joal: Yep, Monday is fine. I knew I'd seen it somewhere before. We saw almost exactly this error here: https://phabricator.wikimedia.org/T311807#8064689 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/811990 but this was on the test cluster. [14:47:31] 10Data-Engineering, 10Event-Platform Value Stream, 10Discovery-Search (Current work), 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), 10Patch-For-Review: Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10CodeReviewBot) pfischer opened https://gitlab.wikimedia.org/repos... [14:47:41] 10Data-Engineering, 10Event-Platform Value Stream, 10Discovery-Search (Current work), 10MW-1.41-notes (1.41.0-wmf.15; 2023-06-27), 10Patch-For-Review: Add support for redirects in CirrusSearch - https://phabricator.wikimedia.org/T325315 (10CodeReviewBot) [14:51:36] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: [Event Platform] Understand, document, and implement error handling and retry logic when fetching data from the MW api - https://phabricator.wikimedia.org/T309699 (10dcausse) >>! In T309699#8936632, @Ottomata wrote:... [16:35:36] 10Analytics, 10Data-Engineering-Icebox, 10CX-analytics, 10Language-analytics, 10Technical-Debt: Special:ContentTranslationStats is slow and getting crowded - https://phabricator.wikimedia.org/T325790 (10MNeisler) Thanks @santhosh for creating this task. I agree that reusing existing analytics infrastruct... [16:36:14] (03PS3) 10TChin: Skip deterministic types tests for legacy schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) [17:56:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:14:43] (03CR) 10Ottomata: [C: 03+1] "Nice! Feel free to self merge!" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [18:18:42] 10Data-Engineering-Planning, 10Data-Platform-SRE, 10Patch-For-Review: Rebuild all hadoop packages for bullseye with different distribution suffix mechanism - https://phabricator.wikimedia.org/T337465 (10BTullis) Some progress, but it's still not quite right. If we look at one of the packages, we can see that... [18:51:31] 10Data-Engineering-Planning, 10DC-Ops, 10Data-Platform-SRE, 10SRE, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bull... [19:05:45] 10Data-Engineering: Superset permissions for nshahquinn-wmf - https://phabricator.wikimedia.org/T339385 (10nshahquinn-wmf) [19:08:27] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Product-Analytics: Replace anaconda-wmf with smaller, non-stacked Conda environments - https://phabricator.wikimedia.org/T302819 (10nshahquinn-wmf) 05Open→03Resolved a:03nshahquinn-wmf The main piece has been accomplished: we have non-stacked conda envi... [19:08:51] 10Data-Engineering, 10Data-Engineering-Jupyter, 10Product-Analytics: Replace anaconda-wmf with smaller, non-stacked Conda environments - https://phabricator.wikimedia.org/T302819 (10nshahquinn-wmf) a:05nshahquinn-wmf→03None [19:47:51] 10Data-Engineering-Planning, 10DC-Ops, 10Data-Platform-SRE, 10SRE, and 2 others: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1156.eqiad.wmnet with OS bullseye... [21:36:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:41:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:06:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:11:45] (SystemdUnitFailed) firing: (3) hadoop-yarn-nodemanager.service Failed on analytics1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed