[00:03:30] (03CR) 10Gergő Tisza: [C: 03+2] homepagemodule: Add mentorship-optout/mentorship-optin actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/786268 (https://phabricator.wikimedia.org/T287915) (owner: 10Urbanecm) [00:04:00] (03Merged) 10jenkins-bot: homepagemodule: Add mentorship-optout/mentorship-optin actions [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/786268 (https://phabricator.wikimedia.org/T287915) (owner: 10Urbanecm) [01:11:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [01:16:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [02:36:03] PROBLEM - Hadoop NodeManager on an-worker1116 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:40:39] RECOVERY - Hadoop NodeManager on an-worker1116 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:39:39] PROBLEM - Hadoop NodeManager on an-worker1103 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [03:52:53] RECOVERY - Hadoop NodeManager on an-worker1103 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [04:03:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:08:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:12:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [04:17:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [04:41:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [04:46:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [06:53:34] (03CR) 10Aqu: [C: 03+1] "Note for deployer: After deployment, please rerun the service mediawiki-history-drop-snapshot on an-launcher1002. (See analytics-alerts@wi" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786448 (https://phabricator.wikimedia.org/T303988) (owner: 10Joal) [07:07:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:12:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:12:57] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:17:57] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:23:12] (03CR) 10Joal: Add new wikis to sqoop list (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/787513 (https://phabricator.wikimedia.org/T304632) (owner: 10Snwachukwu) [07:32:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:37:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [07:39:12] (03CR) 10Joal: [V: 03+1] Update refinery-drop-mediawiki-snapshots (031 comment) [analytics/refinery] - 10https://gerrit.wikimedia.org/r/786448 (https://phabricator.wikimedia.org/T303988) (owner: 10Joal) [07:41:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Research: Update HDFS links tables as Mediawiki changes - https://phabricator.wikimedia.org/T304979 (10JAllemandou) Thank you @Ladsgroup for the update. We need to prioritize the change in DE soon :) [07:45:59] 10Data-Engineering, 10Airflow: Use airflow to load cassandra - https://phabricator.wikimedia.org/T306962 (10JAllemandou) Hey @mforns - Thank you for your comment :) For most jobs it's a single action - my question was more about the need for an executor versus reusing the existing spark one, as it'll only be a... [09:21:48] 10Data-Engineering: LVS in Analytics VLANs - https://phabricator.wikimedia.org/T288750 (10ayounsi) > Now that the analytics vlan outbound firewall restrictions have been removed in {T298087} - what is the impact on this ticket? Direct impact is that it makes all 3 options easier to implement. > Is it still the... [13:03:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [13:08:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org/?q=alertname%3DEventgateLoggingExternalLatency [13:09:18] 10Data-Engineering: LVS in Analytics VLANs - https://phabricator.wikimedia.org/T288750 (10Ottomata) Hm, I don't think there are many hosts/services for which we need the LVS. Perhaps we can do Option 1 or 3 for just the hosts we need LVS for? [14:26:33] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Ottomata) Just discussed this and other options for installing Spark 3 with @JAllemandou and @Antoine_Quhen. Decisions: - We will create a new debian packaged 'conda base env' with the intention of using this to repl... [17:00:14] 10Data-Engineering-Radar, 10API Platform, 10Platform Engineering Roadmap: Retroactively fix logging to use a RequestScopedLogger where applicable - https://phabricator.wikimedia.org/T305504 (10FGoodwin) 05Open→03In progress [17:00:16] 10Data-Engineering, 10API Platform, 10Platform Engineering Roadmap, 10User-Eevans: AQS 2.0: Implement pageviews endpoints - https://phabricator.wikimedia.org/T288296 (10FGoodwin) [17:13:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:38:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [17:45:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:15:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org/?q=alertname%3DHiveServerHeapUsage [18:39:14] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Htriedman) @Ottomata Thanks for this update! The differential privacy project is currently using a jerry-rigged version of Spark 3 to run our software packages, so please let me know (either in this thread on phab or v... [18:40:52] 10Data-Engineering, 10Airflow: Install spark3 - https://phabricator.wikimedia.org/T295072 (10Ottomata) Nice! Will do. [19:11:29] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, and 2 others: Add superset-next.wikimedia.org domain for superset staging - https://phabricator.wikimedia.org/T275575 (10razzi) [19:11:52] ottomata: yt? Could use another pair of eyes on my superset-next.wikimedia.org plan [19:18:13] ooo razzi bout to do a workout i got 3 mins! [19:18:28] ottomata: ok 3 minute sync in the batcate!? [19:18:31] ya [20:29:17] razzi: how goes? [21:12:05] ottomata: good, was in a meeting, now I have 2 patches for superset-next.wikimedia.org to work: [21:12:05] https://gerrit.wikimedia.org/r/c/operations/puppet/+/666481 and https://gerrit.wikimedia.org/r/c/operations/dns/+/774537 [21:12:44] nice!~ [21:14:54] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10JtsMN) just here to echo continued persistence of this problem, and advocating for continued pressure to address this! [21:40:20] 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) [21:43:39] 10Data-Engineering, 10DC-Ops, 10ops-eqiad: Q4: rack/setup/install stat1010 - https://phabricator.wikimedia.org/T307399 (10RobH) [22:00:54] 10Quarry: Pressing the Stop button in Quarry results in a 500 error - https://phabricator.wikimedia.org/T290146 (10Certes) I may have a workaround: * tick the "Don't allow site to prompt you again" box on the 500 box (text varies by browser and language), or ad-block that box and its modal background * replace t... [22:43:59] (03PS1) 10Razzi: Use vars.qrun_id when stopping query [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/788438 [22:45:10] (03PS2) 10Razzi: Use vars.qrun_id when stopping query [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/788438 (https://phabricator.wikimedia.org/T307297) [22:46:38] (03PS3) 10Razzi: Use vars.qrun_id when stopping query [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/788438 (https://phabricator.wikimedia.org/T307297) [23:13:10] 10Quarry, 10cloud-services-team (Kanban): Request to add razzi to Quarry Nova resource - https://phabricator.wikimedia.org/T307403 (10bd808) [23:13:50] 10Quarry, 10cloud-services-team (Kanban): Request to add razzi to Quarry Cloud VPS project - https://phabricator.wikimedia.org/T307403 (10bd808)