[01:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:33:18] RECOVERY - Hadoop NodeManager on analytics1061 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [07:10:04] (03CR) 10TChin: [C: 03+2] Skip deterministic types tests for legacy schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [07:10:36] (03Merged) 10jenkins-bot: Skip deterministic types tests for legacy schemas [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929782 (https://phabricator.wikimedia.org/T338228) (owner: 10TChin) [08:07:15] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10tchin) [08:53:06] stevemunene, btullis o/ [08:53:12] there are 3 nodes in the hdfs default rack [08:53:35] the ips seem not belonging to any nodes [08:53:51] 10.64.5.2[1-3] [08:53:57] probably old decom nodes? [08:54:54] o/ elukey decommed 3 nodes yesterday [08:55:22] ack makes sense then, the alert started 5 days ago [08:55:33] we shouldn't have nodes in the default rack though... [08:55:55] hm, could it be an artifact of nodes bewing decom? [08:56:05] were the namenodes restarted after the topology change? [08:56:20] bonjour joal [08:56:25] Have we removed them from the topology too early in the process? We've already copied the data from HDFS to other nodes. [08:56:49] nono I think the procedure was fine, it is maybe a stale knowledge of the namenodes [08:57:09] the were restarted 4 days ago, but IIRC the topology change happened more recently [08:57:19] Bonjour elukey :) Hi stevemunene and btullis :) [08:57:23] so I suspect that the running config of the namenodes is not right [08:58:17] topology patch was merged on the 15th, the restart was soon after [08:58:26] hi joal [08:58:35] btullis: o/ [08:58:39] OK, I tweaked the instructions here to try to add clarity last week: https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Cluster/Hadoop/Administration#Decommissioning but maybe I should have written something about expecting and silencing an alarm for the default rack, or had steps in a different order. [08:59:11] Bonjour to you too joal. :-) [08:59:51] ah sorry it was https://gerrit.wikimedia.org/r/c/operations/puppet/+/930580, so analytics106[1-3] right? [09:00:02] decommed but the topology still hasn't changed [09:00:03] I'll try to use "Bore da" for you btullis :) [09:00:31] okok now it makes sense, it is fine to keep the alert, I was just puzzled [09:00:38] Da iawn joal :+1 [09:00:42] in the past we got bitten by the default rack and I thought to follow up [09:01:03] all good sorry for the noise o/ [09:01:48] Ah, it's always a pleasure elukey, never noise :-) [09:02:42] Thanks elukey :) [09:04:21] <3 [09:36:18] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) [09:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:41:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:45:24] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10MediaWiki-libs-HTTP, 10Beta-Cluster-reproducible, 10Wikimedia-production-error: PHP Warning: curl_multi_remove_handle(): supplied resource is not a valid cURL Multi Handle resource - https://phabricator.wikimedia.org/T288624 (10hoo) @Ottomata... [10:14:05] Hey team - The let-go of alerts has reach a problematic level - I'll rerun the failed webrequest run 2023-06-19T17:00, and we need need to discuss this this evening [10:16:17] joal: Is this an airflow re-run? If so, can you show me how you're doing it please? [10:16:25] sure btullis - batcave? [10:16:35] on my way [10:17:56] RECOVERY - Check systemd state on analytics1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:21:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:26:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:41:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:16:45] 10Data-Engineering, 10Traffic: Webrequest x_analtics `wprov` value is incorrectly formatted - https://phabricator.wikimedia.org/T339910 (10JAllemandou) [12:10:14] 10Data-Engineering, 10SRE, 10Traffic: Webrequest x_analtics `wprov` value is incorrectly formatted - https://phabricator.wikimedia.org/T339910 (10JAllemandou) [12:19:32] (03PS1) 10Joal: Add row exclusion to webrequest-refine [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 [12:19:37] mforns: --^ [12:19:39] please :) [12:36:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:41:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:45:35] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10klausman) Change 930610 has been pushed to prod, so now we get the full feed from changeprop. CPU usage of the outlink pods ha... [12:51:45] (SystemdUnitFailed) firing: (4) hadoop-yarn-nodemanager.service Failed on analytics1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:53:23] 10Data-Platform-SRE: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 (10Jclark-ctr) [13:18:08] 10Data-Platform-SRE: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 (10BTullis) I tried adding the disk back in using the existing LD number. ` btullis@an-worker1110:~$ sudo megacli -CfgLdAdd -r0 '[32:4]' -AfterLd2 -a0 Adapter 0: Configure Adapter... [13:34:51] 10Data-Platform-SRE: Disk failure on an-worker1110 - https://phabricator.wikimedia.org/T336929 (10BTullis) The new device is detected by the operating system as `/dev/sdf` ` [Tue Jun 20 13:13:17 2023] sd 0:2:3:0: [sdf] 7812939776 512-byte logical blocks: (4.00 TB/3.64 TiB) [Tue Jun 20 13:13:17 2023] sd 0:2:3:0:... [13:58:11] mforns: I need to drop for kids - I'll be back at standup time - Let's make our merges happen at that time :) [13:58:28] joal: ok [14:13:54] 10Data-Engineering: Canonical-data ownership, definition and update - https://phabricator.wikimedia.org/T339928 (10Antoine_Quhen) [14:47:05] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10xcollazo) @BTullis confirming it is working for me now: ` presto> SELECT count(1) FROM analytics_test_hive.wmf.webrequest WHERE year=2023 and month=6 and da... [14:47:18] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10xcollazo) [14:47:52] 10Data-Platform-SRE, 10Data Pipelines (Sprint 14): Upgrade Presto to release that aligns with Iceberg 1.2.1 - https://phabricator.wikimedia.org/T337335 (10xcollazo) a:05xcollazo→03BTullis [14:48:24] (03CR) 10Ottomata: page_change: add a flag for missing revision data (032 comments) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699) (owner: 10Gmodena) [14:53:17] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) I've updated the branch to build Airflow 2.6.1 instead and I'm now creating a package from it. I'll test this new package on an-test-client1001 when it's complete. [14:59:42] (03CR) 10Milimetric: Add row exclusion to webrequest-refine (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [14:59:46] (03CR) 10Milimetric: [C: 03+2] Add row exclusion to webrequest-refine [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [15:05:59] (03CR) 10Snwachukwu: Add row exclusion to webrequest-refine (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [15:06:00] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools deterministic schema test should fail if a object field does not have schema - https://phabricator.wikimedia.org/T338228 (10JArguello-WMF) 05Open→03Resolved [15:06:04] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) 05Open→03Resolved [15:06:06] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): jsonschema-tools tests - ensure that array items type is set - https://phabricator.wikimedia.org/T329515 (10JArguello-WMF) 05Open→03Resolved [15:06:08] 10Data-Engineering-Planning, 10Event-Platform Value Stream (Sprint 14 B): Improve mediawiki-event-enrichment test suite - https://phabricator.wikimedia.org/T328013 (10JArguello-WMF) [15:06:10] 10Data-Engineering, 10Release-Engineering-Team, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutillities-python should publish python doc to doc.wikimedia.org - https://phabricator.wikimedia.org/T337475 (10JArguello-WMF) 05Open→03Resolved [15:06:14] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Get coverage artifacts from Kokkuri - https://phabricator.wikimedia.org/T337400 (10JArguello-WMF) 05Open→03Resolved [15:06:18] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Enable HA failover for flink-kubernetes-operator - https://phabricator.wikimedia.org/T336185 (10JArguello-WMF) 05Open→03Resolved [15:06:28] (03CR) 10Mforns: [C: 04-1] "Just -1inig to prevent merge, I think there's a typo in the query, see inline comment." [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [15:20:26] 10Data-Platform-SRE, 10Data Pipelines (Sprint 12), 10Patch-For-Review: anlytics/refinery deployment broken at refinery-deploy-to-hdfs - https://phabricator.wikimedia.org/T334493 (10Gehel) [15:46:52] 10Data-Engineering: Canonical-data ownership, definition and update - https://phabricator.wikimedia.org/T339928 (10Antoine_Quhen) [15:55:42] 10Data-Platform-SRE, 10Patch-For-Review: Upgrade Airflow to version 2.6.1 - https://phabricator.wikimedia.org/T336286 (10BTullis) I have successfully built the airflow 2.6.1 package and added it to reprepro. ` btullis@apt1001:~$ wget https://gitlab.wikimedia.org/api/v4/projects/93/packages/generic/airflow/2.6.... [16:00:26] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10Ottomata) [16:02:50] Is the etherpad up to date in respect of the analytics-deployment train? https://etherpad.wikimedia.org/p/analytics-weekly-train [16:27:15] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10Ottomata) Let's get privacy review on this one. https://gerrit.wikimedia.org/r/c/operatio... [16:34:13] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Release mediawiki.page_change.v1 stream - https://phabricator.wikimedia.org/T336817 (10Ottomata) [16:34:46] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:28] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:50:02] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:00:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:36:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:44:17] (03PS2) 10Joal: Add row exclusion to webrequest-refine [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 [17:44:47] (03CR) 10Joal: "Thanks for comments :)" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [17:46:31] (03PS3) 10Joal: Add row exclusion to webrequest-refine [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 [17:46:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:46:53] (03CR) 10Joal: "Sorry I missed one comment, and messed up my previous review" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [17:50:14] (03CR) 10Mforns: [C: 03+2] "LGTM!" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [17:50:51] milimetric: if you're nearby, would you mind removing your -1? [17:51:10] mwarf neverminf milimetric - it was mforns - I think I'm tired [17:51:44] (03CR) 10Joal: [V: 03+2 C: 03+2] "Merging for deploy" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/931591 (owner: 10Joal) [17:51:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:52:35] !log deploy Refinery to unbreak webrequrest [17:52:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:00:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:01:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:17] joal: let me know if I can take over and rerun those jobs with the new params, since it's late for you [18:04:42] heya milimetric - we're still in the old meeting :) [18:04:49] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:04:59] omw [18:06:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:13:12] !log deployed airflow analytics to fix webrequest job [18:13:13] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:15:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:20:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:28:11] (03PS8) 10Gmodena: page_change: add a flag for missing revision data [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/929786 (https://phabricator.wikimedia.org/T309699) [18:31:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:43:19] joal, milimetric, xcollazo, the webrequest job worked! [18:44:47] \o/ Hooray! [18:45:35] \o/ [18:46:13] I still don't see how you added the `excluded_row_ids` from the Airflow web UI. [18:56:36] btullis: You can go to the Admin->Variables page [18:56:44] there you will find the Variables we use [18:57:15] one of them corresponds to the refine_webrequest_hourly_text DAG [18:57:34] if you open it, you'll see that the excluded row ids are there. [18:57:56] the DAG has a mechanism that will override properties with what you put in the Airflow Variable [18:58:41] the default for excluded_row_ids is '' (empty string), but if you set it in the Variable, it will override the default [18:59:13] Yes, I see. Thanks. I had looked at that page, but failed to see the ` refine_webrequest_hourly_text_config ` key. [18:59:35] joal, milimetric, xcollazo, btullis: BTW, the second job succeeded, so I'm going to remove the exclude_row_ids from the Variable now. [19:00:14] is it worth taking a screenshot before you do and adding something to wikitech about how to use this feature in case we need to do it again? [19:00:27] +1 mforns - removing those is important :) [19:01:18] I'm super happy with this fix :) Thanks a lot btullis, mforns, milimetric and xcollazo for the team effort <3 [19:01:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:02:35] I added a note to the slack thread [19:02:51] And with that I'll call it a day :) [19:03:04] Talk to you tomorrow folks [19:03:33] Thanks everyone for the support. [19:05:02] Ah I had almost forgotten: milimetric, I'm eager to have feedback from Mat - Could you please keep me posted on that matter? [19:06:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:07:54] will do Jo, I've pinged him on the slack thread that follows yours in working-with-data if you want to subscribe, but I'll report back anything I hear [19:08:11] :] [19:13:14] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster [19:15:10] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:12] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:20:38] 10Data-Engineering-Planning, 10Epic, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: Deploy mediawiki-page-content-change-enrichment to wikikube k8s - https://phabricator.wikimedia.org/T325303 (10Ottomata) [19:21:40] We have just received an alert for `wikidata_dump_to_hive_weekly`from a DAG run that started on 2023-02-13. In this instance, do I simply mark the task as successful? [19:21:44] https://usercontent.irccloud-cdn.com/file/J6L2LbSk/image.png [19:21:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:50] http://localhost:8600/dags/wikidata_dump_to_hive_weekly/grid?root=&dag_run_id=scheduled__2023-02-13T00%3A00%3A00%2B00%3A00 [19:22:38] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Update eventgate and eventstreams helm chart to use automatic kafka egress networkpolicies and envoy service mesh - https://phabricator.wikimedia.org/T335024 (10Ottomata) [19:23:07] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B): Update eventgate and eventstreams helm chart to use automatic kafka egress networkpolicies and envoy service mesh - https://phabricator.wikimedia.org/T335024 (10Ottomata) [19:31:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:37:37] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10bking) Notes from today's pairing session: - New hosts (R-450 chassis) have more CPUs/threads (48 as opposed to prior hosts' 32). But th... [19:51:18] !log merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/931683 to fix the aqs_hourly datahub lineage failure [19:51:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:55:43] !log clearing the first failed emit_lineage_to_datahub_for_hive_wmf_aqs_hourly task https://usercontent.irccloud-cdn.com/file/vW6YdEof/image.png [19:55:45] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:01:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:06:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:09:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster executed w... [20:16:45] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:16:59] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:17:52] Thanks btullis for the fix: https://gerrit.wikimedia.org/r/c/operations/puppet/+/931683 [20:30:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:45:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:46] 10Data-Engineering, 10Event-Platform Value Stream (Sprint 14 B), 10Patch-For-Review: eventutilities-python: review and clean up in preparation for a GA release. - https://phabricator.wikimedia.org/T336488 (10CodeReviewBot) otto merged https://gitlab.wikimedia.org/repos/data-engineering/eventutilities-python/... [20:51:52] btullis, about `wikidata_dump_to_hive_weekly` I've marked it as `success`. The wikidata json dump source is gone by now. Someone may have triggered the execution inadvertently... Could not find the webserver log to confirm. [20:59:19] !log Manually marked as success `wikidata_dump_to_hive_weekly` iteration `2023-02-13` in Airflow analytics [20:59:20] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:00:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:05] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:17:07] RECOVERY - Hadoop NodeManager on analytics1063 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [21:23:07] btullis: oh, wow! Yes, I think we can mark this as successful, as we discussed in our meeting today. This way the Airflow HOME screen is clear of red spots that can be distracting when monitoring the status of jobs, no? [21:23:14] 10Data-Engineering-Planning, 10Event-Platform Value Stream, 10Machine-Learning-Team: Add a new outlink topic stream for EventGate main - https://phabricator.wikimedia.org/T328899 (10Isaac) this is VERRRRRY exciting! thank you all! I took a look at the event table on Hive and did some basic quality checks and... [21:28:52] !log Manual edit of `/srv/airflow-analytics/connections.yml` following changes in https://gerrit.wikimedia.org/r/c/operations/puppet/+/931690 to avoid alerts Airflow analytics aqs_hourly [21:28:55] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [21:46:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:59] (SystemdUnitFailed) firing: (2) produce_canary_events.service Failed on an-launcher1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:30] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster [22:47:54] 10Data-Platform-SRE, 10Discovery-Search (Current work): Investigate performance differences between wdqs2022 and older hosts - https://phabricator.wikimedia.org/T336443 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by bking@cumin1001 for host wdqs2021.codfw.wmnet with OS buster completed:...