[00:38:12] RECOVERY - Check unit status of monitor_refine_eventlogging_legacy on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_eventlogging_legacy https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [02:35:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [04:02:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [04:07:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [06:35:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:19:52] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Compactions resulting from the import of this table have now finished. {F34882193,width=60%} [10:35:01] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:40:40] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdrewniak) Hey @Jdlrobson , after discussions with the data-engineering team, I agreed to implement this migration (since I originally wrote this co... [10:41:00] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review, 10Readers-Web-Backlog (Needs Prioritization (Tech)): WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdrewniak) [11:10:55] PROBLEM - Hadoop NodeManager on an-worker1125 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:11:42] elaragon: Heya - your job is killing our cluster --^ [11:13:55] poor cluster :( [11:26:26] RECOVERY - Hadoop NodeManager on an-worker1125 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:36:19] PROBLEM - Hadoop NodeManager on an-worker1121 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [11:55:33] RECOVERY - Hadoop NodeManager on an-worker1121 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [12:25:00] 10Data-Engineering, 10Infrastructure-Foundations, 10SRE, 10Traffic-Icebox, 10netops: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10ayounsi) [12:31:54] hey teammm! [12:32:06] Hi mforns :) [12:32:13] :] [12:32:31] Hiya. [12:34:34] :) [12:35:57] joal, btullis: can any of you guys help me? I'm having issues when connecting to the hive metastore from airflow, and I suspect it might be a Kerberos issue (or else a PEBCAK issue) [12:36:34] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) Thanks @ottomata - Having given this some thought, I believe that my preferred course of action would be to send the events to Logstash, rather t... [12:36:34] Heya mforns - could be Kerberos, yes [12:36:49] mforns: Happy to help. Do you want to bc? [12:38:51] joal: the error seems related. btullis, yes, let's bc! [13:46:38] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Wikidata, and 4 others: Add MCR slot information to revision-create events - https://phabricator.wikimedia.org/T293195 (10Gehel) 05Open→03Resolved [13:46:57] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Wikidata, and 2 others: Events missing from event.rdf_streaming_updater_fetch_failure but present in /wmf/data/raw/event/eqiad.rdf-streaming-updater.fetch-failure - https://phabricator.wikimedia.org/T294361 (10Gehel) 05Open→03Resolved [14:04:50] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10BTullis) Super-useful meeting with @JAllemandou about this. It seems that there are several threads of what we need to do here, all of which are under wa... [14:17:49] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10Ottomata) Okay, sounds good! Just FYI: {T291645} [14:35:53] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [16:03:21] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) [16:27:59] 10Analytics-Radar, 10Machine-Learning-Team, 10Patch-For-Review: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 (10odimitrijevic) [16:28:27] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10mforns) > PYSPARK_PYTHON=conda_env/bin/python spark2-submit --master yarn --deploy-mode cluster --archives=hdfs:///user/otto/conda_env.tgz#conda_env hdfs:///user/otto/call.py 'test_proj... [16:29:49] 10Data-Engineering, 10observability, 10serviceops, 10Patch-For-Review: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10odimitrijevic) p:05Triage→03Medium [16:34:49] (03PS4) 10Joal: Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) [16:43:28] (03CR) 10jerkins-bot: [V: 04-1] Add SparkSQLNoCLIDriver job [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/745587 (https://phabricator.wikimedia.org/T297427) (owner: 10Joal) [17:27:20] 10Data-Engineering-Kanban, 10Airflow: Tooling for Deploying Conda Environments - https://phabricator.wikimedia.org/T296543 (10Ottomata) I've been able to build a packed and stacked (stack 'N' pack?) conda env on to of anaconda-wmf. This allows us to build the conda env without including dependencies already av... [17:52:10] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Readers-Web-Backlog, 10Patch-For-Review: WikipediaPortal Event Platform Migration - https://phabricator.wikimedia.org/T282012 (10Jdlrobson) Thanks for the background. Assigning to Olga to plan. [18:01:35] ottomata: would you have a minute for me? [18:01:52] joal: ya [18:02:06] ottomata: I'm doing some thinking about the SpqarkSQL thing and would like some brainstorm [18:02:11] ottomata: bc? [18:02:13] in bc [18:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [20:24:20] (03PS1) 10Kosta Harlan: mediawiki/welcomesurvey/interaction: Add new action [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/746951 (https://phabricator.wikimedia.org/T267273) [22:36:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [22:57:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [23:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [23:35:22] (03CR) 10Gergő Tisza: [C: 03+2] mediawiki/welcomesurvey/interaction: Add new action [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/746951 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan) [23:36:04] (03Merged) 10jenkins-bot: mediawiki/welcomesurvey/interaction: Add new action [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/746951 (https://phabricator.wikimedia.org/T267273) (owner: 10Kosta Harlan)