[00:00:17] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Triage Superset Dashboard Timeouts - https://phabricator.wikimedia.org/T294768 (10razzi) I'm going to go ahead and call this done; new timeouts can create new tickets. [00:00:24] 10Data-Engineering, 10Data-Engineering-Kanban: Triage Superset Dashboard Timeouts - https://phabricator.wikimedia.org/T294768 (10razzi) [00:05:21] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10razzi) Currently what we're seeing is `upstream request timeout` with dashboards that time out such as IP Masking Dashboard (open it in multiple tabs if... [00:25:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10razzi) A useful trick for testing timeouts is to use a sql datastore (like mysql_staging) and run `select sleep(1000)`, which can be done easily in sqlla... [00:30:18] PROBLEM - Check unit status of monitor_refine_eventlogging_analytics on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_eventlogging_analytics https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [00:36:22] PROBLEM - Hadoop NodeManager on an-worker1089 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:46:12] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.428 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [00:47:22] RECOVERY - Hadoop NodeManager on an-worker1089 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [00:58:18] PROBLEM - Hadoop NodeManager on an-worker1129 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:06:58] RECOVERY - Hadoop NodeManager on an-worker1129 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [01:07:52] 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Run Atlas on test cluster - https://phabricator.wikimedia.org/T296670 (10razzi) Here's what the atlas ui looks like after loading their sample dataset by running `/opt/atlas/bin/quick_start.py` (username and password are `admin`, loading takes abou... [01:32:02] PROBLEM - Hadoop NodeManager on an-worker1126 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:01:56] RECOVERY - Hadoop NodeManager on an-worker1126 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [02:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [05:43:31] PROBLEM - eventgate-analytics-external validation error rate too high on alert1001 is CRITICAL: 2.064 gt 2 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [06:34:42] good morning :) [06:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [06:35:13] from an-worker1089 I see java OOM reported in the logs, the app mentioned at that time seems to be https://yarn.wikimedia.org/proxy/application_1637058075222_72643 [06:37:46] same thing on an-worker1126 [06:38:21] I am seeing a ton of shuffle operations being done [06:44:20] --- [06:44:40] for eventgate, there is a schema causing issues [06:44:41] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=75&orgId=1&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&from=now-24h&to=now [06:44:53] it seems mediawiki.web_ui_scroll [06:46:42] https://phabricator.wikimedia.org/rOMWCab7c3a1b5c01e85cb93b625334500eb4a51e64d3 [06:46:47] was added yesterday [06:48:05] https://logstash.wikimedia.org/app/dashboards#/view/AXN5OoJu3_NNwgAUlbUT?_g=h@c823129&_a=h@ff2a9c2 [06:48:13] '' should have required property 'access_method' lol [06:52:37] commented in https://phabricator.wikimedia.org/T292586, not sure if there is a better procedure [10:29:16] elukey: Thanks for looking into these. For the Yarn job I can see that it has finished now. Do you think that we need to work with the user to make that job more cluster-friendly? It might have been a one-off query, I suppose? [10:32:59] btullis: I think that Joseph was looking into logs, I have not a lot of experience with Spark to judge but yeah we may need to follow up :( [10:33:11] it may be a misuse of spark shufflers [10:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [10:36:52] Hi btullis, elukey: I looked at the job a bit, it's reading/shuffling a lot of data (mediawiki text), and therefore puts pressure on shufflers [10:38:31] ah! [11:09:03] 10Data-Engineering, 10Data-Engineering-Kanban: Move spark.local.dir to /srv on stat100x - https://phabricator.wikimedia.org/T295346 (10BTullis) Successfully shut down all Jupyter notebooks on all stat100x servers. Now when they are restarted they pick up the new ReadWritePaths setting. ` btullis@stat1004:~$ sy... [12:33:27] 10Analytics, 10CheckUser, 10Patch-For-Review, 10Platform Team Workboards (Clinic Duty Team), 10Schema-change: Schema changes for `cu_changes` and `cu_log` table - https://phabricator.wikimedia.org/T233004 (10JAllemandou) Hi @Ladsgroup :) Thanks for caring ! You'll find some minimal doc here: https://wiki... [12:44:53] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Created the snapshot with the following command: ` btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].... [12:46:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10BTullis) a:05razzi→03BTullis [12:48:38] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review, 10User-razzi: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 (10BTullis) I'm assigning this ticket to myself to see if I can make any more sense of it. Hope that's OK @razzi. [12:49:26] 10Data-Engineering, 10Data-Engineering-Kanban: Re-enable Superset **metadata** caching - https://phabricator.wikimedia.org/T295295 (10BTullis) a:03BTullis [12:53:27] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) Creation confirmed: ` btullis@cumin1001:~$ sudo cumin --mode async 'aqs100[4,7].eqiad.wmnet' 'nodetool-a list... [12:54:00] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) [13:00:00] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Repair and reload cassandra2 mediarequest_per_file data table - https://phabricator.wikimedia.org/T291470 (10BTullis) The transfer of the 4 snapshots is under way to the aqs1010 and aqs1011 nodes using the following script from... [14:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [14:50:13] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10Ottomata) 05Resolved→03Open Perhaps hadoop 3 can do this with kerberos? https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.... [14:50:15] 10Data-Engineering, 10Airflow, 10Epic, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10Ottomata) [14:52:10] 10Data-Engineering, 10Airflow, 10Spike: Explore Containerization Solutions for DE Applications - https://phabricator.wikimedia.org/T288254 (10Ottomata) Just reopened https://phabricator.wikimedia.org/T296543#7538145, perhaps we should merge these together. [14:52:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Airflow: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10Ottomata) [15:52:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10SRE-swift-storage: Deploy research_poc Swift credidentials to Hadoop - https://phabricator.wikimedia.org/T296945 (10Ottomata) [16:00:03] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Desktop Improvements, and 4 others: Sticky header: Add agent_type and access_method to sticky header instrumentation - https://phabricator.wikimedia.org/T294246 (10Jdlrobson) According to T292586#7542866 it looks like there's a problem in the im... [16:12:23] hey, I'm trying to make https://superset.wikimedia.org/r/914 available to an user who's in the `wmf` group, but not in analytics-privatedata-users. Gave up after few tries (as I can't test it from my own account) and sent them an export instead, but I'm still interested to know what I'm doing wrong. [16:13:22] Or maybe _no_ hdfs reads for users w/o shell are allowed? I thought chmod'ing to o=rx will work, but maybe not :)) [16:37:55] urbanecm: to read from hadoop the user needs to be deployed to the hdfs master nodes.. we offer two ways - with shell access, or without it [16:37:58] see https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Dashboards_in_Superset_/_Hive_interfaces_(like_Hue)_that_do_access_private_data [16:38:16] so yeah the use case that you are trying to work on is not viable currently [16:38:32] (IIUC wmf users that don't have their username in puppet basically, only ldap accounts) [16:39:18] elukey: ack, thanks. "Dashboards in web tools like Turnilo and/or Superset that do not access private data" made me think this is possible, but apparently there's a different kind of dashboards that bullet point refers to? [16:39:19] the users that you are targeting can see data coming from Druid though [16:39:40] exactly yes, we have two source of data [16:39:48] presto (that is basically fetching from hdfs) [16:40:02] druid, that is completely different and no authentication is enforced [16:40:29] datasets are explicitly indexed on druid, so not everything on hadoop/hdfs is on it [16:42:26] i see. can users who aren't declared in puppet access the staging DB? [16:43:15] urbanecm: mysql, you mean? i think so. [16:44:01] yes ottomata [16:45:50] at the moment the datasources are not configured in superset [16:45:58] or better, the connection doesn't work [16:46:45] but we could in theory think about sharing the staging db [16:47:09] I'd vote against sharing all the other dbs on dbstore though, until we have 2FA at least [16:52:16] elukey: i can access the staging db in superset [17:24:06] PROBLEM - Hadoop NodeManager on an-worker1092 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:26:53] !log Kill paragon job to prevent more nodemangers to OOM [17:26:56] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:43:32] RECOVERY - Hadoop NodeManager on an-worker1092 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process [17:51:12] ottomata: weird, "Test connection" fails in the datasources. Do you mean that you can pull data from the staging db on dbstores? [17:51:19] like in sqllab etc.. [17:55:26] ah right I see tables in sqllab [17:56:33] so the datasource panel is not really useful with that testconnection :( [17:56:53] ottomata: should we have that "wikishared" datasource though? [17:57:33] !log drop "EventLogging MySQL" datasource from Superset (not valid anymore) [17:57:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [17:58:25] joal: can I drop https://superset.wikimedia.org/tablemodelview/list/?filters=(table_name:Eventlogging)&pageIndex=0&sortColumn=changed_on_delta_humanized&sortOrder=desc ? [17:59:09] yes elukey - thanks! [18:02:55] cleaned up :) [18:04:21] ottomata: (if you have time later on, can you quickly review https://gerrit.wikimedia.org/r/c/operations/puppet/+/743150 and let me know if you are ok? If so I'll move the test cluster to the new uid/gid tomorrow :) [18:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [19:37:25] (03CR) 10Razzi: [C: 03+2] Link to AQS documentation instead of Research page [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/742764 (https://phabricator.wikimedia.org/T295298) (owner: 10Milimetric) [19:49:34] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.9531 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [20:05:51] !log restarting pageview-druid-daily-coord (killing 0062888-210701181527401-oozie-oozi-C) - I can't seem to rerun a particular hour, so just starting again from that hour. [20:05:53] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [20:17:55] ottomata: that's weird - I'm usually able to rerun instances :( [20:23:50] i know me too [20:24:04] joal: some my user became the owner of the workflow i was trying to reurn [20:24:08] somehow* [20:28:33] razzi: shall I deploy wikistats? [22:35:00] (EventgateLoggingExternalLatency) firing: (2) Elevated latency for GET events on eventgate-logging-external in codfw. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org