[00:00:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [00:10:27] (HiveServerHeapUsage) firing: (2) Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://alerts.wikimedia.org [00:15:27] (HiveServerHeapUsage) resolved: (2) Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://alerts.wikimedia.org [00:36:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [00:41:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [02:51:37] 10Data-Engineering: reset kerberos password - https://phabricator.wikimedia.org/T303146 (10Effeietsanders) [02:52:01] 10Data-Engineering: reset kerberos password - https://phabricator.wikimedia.org/T303146 (10Effeietsanders) [04:46:11] 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy) [04:48:05] 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy) [04:51:37] 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy) If someone has a better idea if `opis/json-schema` or `justinrainbow/json-schema` (preferably the l... [04:54:15] 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy) [05:41:36] 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Marostegui) I don't know what's the status of this anymore as I have been on holid... [06:48:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Zache) Currently works for me in toolforge db. [07:40:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Marostegui) To be clear, this was only fixed on two hosts and on some wikis, but n... [07:42:44] 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Ladsgroup) List of wikis that have this table: https://noc.wikimedia.org/conf/dbli... [08:07:15] (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [08:12:15] (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org [08:20:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Marostegui) [09:46:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) This is the required DNS change to add the service name: `datahubsearch.svc.eqiad.wmnet` https://gerrit.wikimedia.org/r... [09:47:04] 10Data-Engineering, 10Event-Platform, 10SRE, 10Traffic, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ladsgroup) [09:51:31] 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) p:05Triage→03Medium [10:22:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [10:24:42] joal: I'd love to steal a few minuites of your time for some thoughts on how I might be able to better perform a query I'm running on pageview_hourly :) [10:25:00] Hi addshore - meetings now - at lunch time? [10:25:05] sounds great! [10:25:18] want to pick a specific time or just ping? [10:27:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [10:32:57] elukey: quick question re: this hive JVM flapping alert. Is it usually sufficient to restart the service on an-coord1001, or would you do a DNS based failover to an-coord1002 before restarting, then fail back? [10:36:17] btullis: o/ a failover would probably avoid any job failure, but maybe 80% is low nowadays.. 90% threshold could be good as well. The other thing that I recall is that hive uses more and more memory over time, like there was a little leak (never investigated it though, I recall to have checked some metrics a while ago) [10:36:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [10:45:21] 10Analytics, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: 502, connect failed for intake-analytics.wikimedia.beta.wmflabs.org (Mar 2022) - https://phabricator.wikimedia.org/T303160 (10AlexisJazz) [10:51:27] elukey: Yes, I can see that we seem to have a gradual pattern of leaking: https://grafana.wikimedia.org/d/000000379/hive?orgId=1&var-instance=an-coord1001&viewPanel=7&from=now%2Fy&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=analytics [10:51:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org [10:51:36] I'll do a failover and restart. [10:55:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) This is the required change to start setting up the LVS configuration: https://gerrit.wikimedia.org/r/c/operations/pupp... [10:55:43] btullis: if possible let's do a dump of the heap space (or whatever is growing), so we can follow up later [11:14:06] 10Analytics, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: 502, connect failed for intake-analytics.wikimedia.beta.wmflabs.org (Mar 2022) - https://phabricator.wikimedia.org/T303160 (10Majavah) [11:30:28] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) [11:30:51] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) p:05Triage→03Medium [11:33:37] Heya addshore - is now good? [11:35:11] Yes! [11:35:35] addshore: meet.google.com/cek-oxpa-mge [11:38:05] !log failing over hive to an-coord1001 T303168 [11:38:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:38:10] T303168: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 [11:51:31] !log obtaining summary of heap objects and sizes: `hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16971 > hive-object-storage-and-sizes.T303168.txt` [11:51:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [11:51:34] T303168: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 [11:52:49] !log obtaining heap dump: `hive@an-coord1001:/srv/hive-tmp$ jmap -dump:format=b,file=hive_server2_heap_T303168.bin 16971` [11:52:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:10:12] !log restarted hive-server2 process on an-coord1001 [12:10:14] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:18:30] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) ` Heap dump file created hive@an-coord1001:/srv/hive-tmp$ ls -lh total 8.3G -rw-r--r-- 1 hive hive 19M Mar 7 11:51 hive-object-storage-and-sizes.T303... [12:21:57] ottomata: just fyi I accepted our -internal message (it was held 'cause "Message has more than 10 recipients") [12:22:04] *your (not our) [12:38:44] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10LSobanski) @Jenlenfantwright @LNguyen This task changed state twice (in February and in March) despit... [12:45:44] !log About to deploy airflow-dags/analytics - Migrates wikidata/item_page_link [12:45:46] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:48:29] hey a-team, question: is it safe to consider that revision ids will be always correlative? ie. if rev_id_x > rev_id_y implies that timestamp(rev_id_x) > timestamp(rev_id_y), within the same project? [12:50:02] dsaez: I'm afraid I don't know. It seems logical, but I haven't got a categorical answer for you, sorry. [12:50:30] btullis, np, thanks. [12:51:04] Hi dsaez - I wouldn't trust that - I agree it's natural, but sometimes data is not "logic" :) [12:52:15] specially when it is mediawiki data :D thanks joal. [12:53:11] dsaez: for an approximation, it should be fine (on most cases), but there always is special cases [12:57:53] got it [12:58:14] 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) Reference heap dumps created from a freshly restarted hive-server2 process. ` hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16616 > hive-object-sto... [13:02:12] (03CR) 10Aqu: [C: 03+2] "- Add HQL file triggered from Airflow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/763219 (owner: 10Aqu) [13:02:38] (03CR) 10Aqu: [V: 03+2 C: 03+2] Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/763219 (owner: 10Aqu) [13:05:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [13:09:35] !log About to deploy analytics/refinery - Migrate wikidata/item_page_link/weekly from Oozie to Airflow [13:09:37] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:10:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [13:30:50] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) Discussions that caused changes to the task here are all in the comments. Some notes were... [13:34:13] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) Oh, BTW in case you weren't aware, the Decision Record we are submitting now is explicitly... [13:45:02] joal: helloOoOOo [13:46:16] also aqu_ looking into the wikidata/entity job failkure [13:46:30] i'm not 100% sure what needs done. I see the MR you mentioned was merged [13:46:44] should I just deploy airflow-dags? [13:46:51] and then rerun the airflow job? [13:54:32] Hello! [13:57:06] aqu_: hello! [13:57:27] airflow-dags/analytics has been deployed. And I am finishing the deployment of refinery. [13:57:32] oh okay! [13:57:56] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10Ottomata) 05Open→03Declined Hi, my understanding is that WikiStats 1 is deprecated and maintained as a static site with minimal... [13:58:09] But actually, I am doing it. Don't worry. Because it's important for me to check it's working well. [13:58:50] So now, I am going to kill the Oozie job and schedule the AF one. [13:59:24] 10Analytics-Wikistats, 10Data-Engineering, 10Browser-Support-Opera: Opera 15+ seems not to be recognized correctly - https://phabricator.wikimedia.org/T61816 (10Ottomata) This is not an ops week task. I think this should be groommed with wikistats maintainers and prioritized accordingly. Moving back to inc... [13:59:39] okay, thanks aqu_ [13:59:53] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Confusing filtering on "Active editors by country" topic - https://phabricator.wikimedia.org/T300365 (10Ottomata) This is not an ops week task. I think this should be groommed with wikistats maintainers and prioritized accordingly. Moving back to inco... [14:01:03] Hi ottomata :) [14:01:52] hello! [14:02:06] FYI amm discussing some things with filippo in #wikimedia-observability [14:03:32] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Data-Engineering-Kanban, and 5 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) 05Open→03Resolved a:03Ottomata [14:03:52] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Kanban, 10Data-Engineering-Radar, and 5 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) 05Resolved→03Open Oh, sorry should not have r... [14:03:58] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Kanban, 10Data-Engineering-Radar, and 5 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) [14:14:37] 10Data-Engineering: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10Ottomata) Hi @CMacholan, please confirm that it is okay to delete the following data belonging to rhuang-ctr: ` ====== stat1005 ====== total 16 drwxr-xr-x 7 34282 wikidev 4096 Nov 2 17:55 Editing-movement... [14:25:34] 10Data-Engineering: Check home/HDFS leftovers of ema - https://phabricator.wikimedia.org/T302815 (10Ottomata) 05Open→03Resolved a:03Ottomata Ema does not have any data on stat boxes or in HDFS or in Hive. Removed his homedirs following https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Full_rem... [14:27:44] 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Ottomata) @Gehel can you confirm that it is okay to delete the following files and homedirs belong to Zbyszko? ` ====== stat1004 ====== total 240320 -rw-rw-r-- 1 22656 wikidev 683 Sep 13 11:43 clus... [14:44:02] !log failing back hive services to an-coord1001 [14:44:04] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:49:20] 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Gehel) Looks good to be deleted, but I'd like @dcausse to confirm. [14:51:47] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Discussed in IRC with @joal and @fgiunchedi. Summary: For all task related metrics, we should be able to get Kaf... [14:51:57] 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Gehel) @Ottomata I confirmed with David, nothing to salvage here, please delete. [14:53:04] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10Ottomata) a:05BTullis→03None Ok thanks, we will check back in during or after the week of March 21 to give you a little time for onboarding and to verify. Tha... [14:55:41] 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Ottomata) 05Open→03Resolved a:03Ottomata Thank you! Done following https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Full_removal_of_files_and_Hive_databases_and_tables ` 14:23:07 [@an-... [14:56:27] (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [14:58:30] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Data-Catalog: datahubsearch nodes alerting with "Rate of JVM GC Old generation-s runs" - https://phabricator.wikimedia.org/T302818 (10BTullis) > I would move this section: > > `file { '/usr/share/opensearch/plugins': > ensure => 'directory', >... [14:59:01] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10JAllemandou) I have a wonder about option 2: the Prometheus PushGateway doc says, about using 'POST': //POST works exactly l... [14:59:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Data-Catalog: Complete monitoring setup of datahubsearch nodes - https://phabricator.wikimedia.org/T302818 (10BTullis) p:05Triage→03Medium [15:01:27] (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org [15:03:29] gone for kids - back at standup [15:03:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Helm charts and helmfile deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) Moving this to //in review// whilst {T303049} is being handled by the Service Ops team. [15:06:57] (03CR) 10Aklapper: "This has been merged but will never be deployed" [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/316289 (https://phabricator.wikimedia.org/T64570) (owner: 10Paladox) [15:14:14] joal: so you would recommend a join over something like `AND page_title IN ( SELECT * FROM addshore_temp_topic_pages_all )` ? [15:18:04] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Good catch, I just verified this. Without using a distinct groupingKey, tasks will override each other's metrics... [15:18:56] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Hm, since we have to use distinct groupingKeys per task always, I guess it doesn't matter if we use POST or PUT. Hm. [16:00:10] addshore: I'd express that as a join rather than an IN, but yes, that is it [16:05:33] 10Data-Engineering-Kanban: Projectviews by country Airflow job - https://phabricator.wikimedia.org/T303193 (10Milimetric) [16:05:50] 10Data-Engineering, 10Data-Engineering-Kanban: Projectviews by country Airflow job - https://phabricator.wikimedia.org/T303193 (10Milimetric) a:03Milimetric [16:17:11] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Radar, 10Product-Analytics, and 4 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) a:05Ottomata→03None [16:17:24] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Radar, 10Product-Analytics, and 4 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) [16:25:35] milimetric: would you please review the events I added to the timeline? [16:25:52] Still adding some, but most of the thing is in place I think [16:26:05] I just did joal, thank you! I was taking forever finding that Feb 16 date and by the time I went back to the timeline you had easily won :) [16:31:37] ottomata: looks like we have some errors for druid jobs - may I leet you investigate? [16:53:48] 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot traffic anomaly detection job - https://phabricator.wikimedia.org/T303199 (10mforns) [16:55:17] 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot traffic anomaly detection job - https://phabricator.wikimedia.org/T303199 (10mforns) https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/28 [16:58:08] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow-triggered Spark-jobs produce hdfs-files belonging to the wrong hdfs-user-group - https://phabricator.wikimedia.org/T303201 (10Antoine_Quhen) [17:39:29] joal: looking [17:43:22] btullis: an oozie ruid load job has been failling for aboutt 3 or 4 hours, am wondering if it is related to an-coord1001 failover stuff? [17:51:50] i dunno if related, but we are getting this from a regular job that reads web requests now: org.apache.hadoop.security.AccessControlException: Permission denied: user=analytics-search, access=READ_EXECUTE, inode="/wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7/hour=14":analytics:hdfs:drwxr-x--- [17:52:09] thank you ebernhardson interesting. [17:56:23] ottomata: my assumption is that this is related to your change on hive config about group ownership - the hive-server probably hadn't been restarted, and the change only got picked [17:56:30] hmmmmmmmmmmm [17:56:38] interesting. [17:57:29] https://phabricator.wikimedia.org/T291664 [17:58:51] ottomata: hdfs dfs -ls /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7/ [17:59:15] ottomata: different group, and we have disabled all-readership [17:59:23] yes [17:59:56] but, i hadn't expected the ownership to be affected, just the perms [18:00:00] but, investigating [18:00:04] i think you are probably right [18:34:33] !log restarting hive-server2 on an-coord1001 to revert hive.warehouse.subdir.inherit.perms change - T291664 [18:34:36] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:34:36] T291664: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 [18:35:16] yes, joal i'm not quite understanding why tho; from what I can tell by disabling the thing, it should have used default hadoop perms, which inherit from parent. [18:35:21] i'm reverting [18:37:04] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7 - after reverting - T291664 [18:37:05] ottomata: we'll know from your revert if this is it [18:37:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:37:15] yeajh [18:38:52] ottomata: there are some folders to update because of this [18:38:55] yes [18:39:24] let me list here the ones I can think of - webrequest (text, upload), pageview_actor, pageview, projectview [18:40:24] virtual_pageview [18:40:41] and I think that's it [18:40:57] ok team, gone for tonight [18:42:20] hmm, why is the load-wf-text job failing [18:42:24] i see that it previously wrote bad perms [18:42:33] oh, maybe it was my hive restart [18:45:49] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/mediacounts/year=2022/month=3/day=7 [18:45:49] - T291664 [18:45:51] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [18:45:51] T291664: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 [18:46:40] ? i can't rerun via hue anymore? [18:46:45] JA009: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User otto does not have permission to submit application_1637058075222_691339 to queue production [18:47:00] you shpouldn't run it as yourself ottomata [18:47:09] ou probably have ticked something wrong [18:47:25] user is set to analytics [18:47:50] trying again from coordinator level [18:57:42] seems to work from coordinator level [18:57:50] just not when viewing it from workflow level [19:03:48] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Milimetric) One question about the new proposed development... [19:13:30] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/virtualpageview/hourly/year=2022/month=3/day=7 - revert of T291664 [19:13:32] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:13:32] T291664: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 [19:14:45] !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/*/hourly/year=2022/month=3/day=7 to make sure perms are fixed after revert of T291664 [19:14:48] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [19:20:21] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) Yes, but hopefully the CI will not allow you to m... [19:26:09] 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 (10Ottomata) I had to revert this change, as it somehow broke group ownership for some of the files our oozie loading jobs cr... [19:29:38] 10Data-Engineering: Analysis: incomplete webrequest records - https://phabricator.wikimedia.org/T303215 (10Milimetric) [19:31:30] 10Data-Engineering: Analysis: incomplete webrequest records - https://phabricator.wikimedia.org/T303215 (10Milimetric) [19:49:44] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10mpopov) If I remember right `npm test` fails if the updated... [20:15:15] ottomata: when I run the run_dev_instance.sh script from airflow-dags, how come it only shows me one dag in the UI and I can't see other DAGs? [20:28:31] what are you settting as your dags folder? [20:36:10] milimetric: ^ [20:38:24] so I'm trying to find a good dev flow. Right now, trying: [20:38:39] * send merge request on a branch [20:38:48] * checkout on stat box [20:39:01] * ./run_dev_instance.sh airflow [20:39:59] this morning talking with Marcel and others I understood that when I do this, it loads up all the jobs in that folder and pauses them by default. So I was assuming that I could tunnel and unpause the job I'm working on, while tweaking the config [20:40:34] ./run_dev_instance.sh airflow ? [20:40:38] makes sense [20:40:41] i think you want maybe ./run_dev_instance.sh analytics [20:40:41] ? [20:40:43] or analytics_test [20:40:44] ? [20:40:52] ah! yes [20:41:47] and also set your desired port with '-p ' this avoids that several devs collide in the same port [20:45:03] sorry, I meant ./run_dev_instance analytics, that's what I'm doing. I double checked 'cause I'm an airhead [20:46:08] oh won't run_dev_instance quit if the port is taken? [20:46:37] (03PS6) 10Ottomata: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [20:46:52] milimetric: which stat box? [20:46:59] stat1004 [20:47:11] milimetric: yes, I guess it won't work if the port is taken [20:47:31] milimetric: that sounds like it should work to mme [20:47:39] milimetric: wanna cave? [20:47:49] the dag I see is useragent_distribution [20:47:50] sure! [20:47:58] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal) [20:48:16] (in the cave) [20:51:05] ok omw [20:54:56] milimetric: just killed my 8080 webserver [22:10:38] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Hmph. Okay I got some stuff working, although I'm not so sure anymore about calling `delete` on job startup. I c... [22:15:15] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) @fgiunchedi Q for you. I think using `task_number` in the groupingKey will increase cardinality. AFAIK the assoc...