[00:00:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[00:10:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: (2) Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage  - https://alerts.wikimedia.org
[00:15:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: (2) Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage  - https://alerts.wikimedia.org
[00:36:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org
[00:41:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org
[02:51:37] <wikibugs>	 10Data-Engineering: reset kerberos password - https://phabricator.wikimedia.org/T303146 (10Effeietsanders)
[02:52:01] <wikibugs>	 10Data-Engineering: reset kerberos password - https://phabricator.wikimedia.org/T303146 (10Effeietsanders)
[04:46:11] <wikibugs>	 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy)
[04:48:05] <wikibugs>	 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy)
[04:51:37] <wikibugs>	 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy) If someone has a better idea if `opis/json-schema` or `justinrainbow/json-schema` (preferably the l...
[04:54:15] <wikibugs>	 10Data-Engineering, 10Librarization, 10MediaWiki-extensions-EventLogging, 10MediaWiki-extensions-JsonData: Librarise Libs/JsonSchemaValidation or replace - https://phabricator.wikimedia.org/T303131 (10Reedy)
[05:41:36] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Marostegui) I don't know what's the status of this anymore as I have been on holid...
[06:48:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Zache) Currently works for me in toolforge db.
[07:40:57] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Marostegui) To be clear, this was only fixed on two hosts and on some wikis, but n...
[07:42:44] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10DBA, 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Ladsgroup) List of wikis that have this table: https://noc.wikimedia.org/conf/dbli...
[08:07:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) firing: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[08:12:15] <jinxer-wm>	 (EventgateLoggingExternalLatency) resolved: Elevated latency for POST events on eventgate-logging-external in eqiad. - https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate - https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?viewPanel=79&orgId=1&var-service=eventgate-logging-external - https://alerts.wikimedia.org
[08:20:29] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Persistence (Consultation), 10Data-Services, and 2 others: Toolforge db: View 'fiwiki_p.flaggedrevs' references invalid table/column/rights to use them - https://phabricator.wikimedia.org/T302233 (10Marostegui)
[09:46:16] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) This is the required DNS change to add the service name: `datahubsearch.svc.eqiad.wmnet` https://gerrit.wikimedia.org/r...
[09:47:04] <wikibugs>	 10Data-Engineering, 10Event-Platform, 10SRE, 10Traffic, and 2 others: Banner sampling leading to a relatively wide site outage (mostly esams) - https://phabricator.wikimedia.org/T303036 (10Ladsgroup)
[09:51:31] <wikibugs>	 10Data-Engineering, 10Data-Catalog, 10SRE, 10serviceops, 10Service-deployment-requests: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) p:05Triage→03Medium
[10:22:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org
[10:24:42] <addshore>	 joal: I'd love to steal a few minuites of your time for some thoughts on how I might be able to better perform a query I'm running on pageview_hourly :)
[10:25:00] <joal>	 Hi addshore - meetings now - at lunch time?
[10:25:05] <addshore>	 sounds great!
[10:25:18] <addshore>	 want to pick a specific time or just ping?
[10:27:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org
[10:32:57] <btullis>	 elukey: quick question re: this hive JVM flapping alert. Is it usually sufficient to restart the service on an-coord1001, or would you do a DNS based failover to an-coord1002 before restarting, then fail back?
[10:36:17] <elukey>	 btullis: o/ a failover would probably avoid any job failure, but maybe 80% is low nowadays.. 90% threshold could be good as well. The other thing that I recall is that hive uses more and more memory over time, like there was a little leak (never investigated it though, I recall to have checked some metrics a while ago)
[10:36:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org
[10:45:21] <wikibugs>	 10Analytics, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: 502, connect failed for intake-analytics.wikimedia.beta.wmflabs.org (Mar 2022) - https://phabricator.wikimedia.org/T303160 (10AlexisJazz)
[10:51:27] <btullis>	 elukey: Yes, I can see that we seem to have a gradual pattern of leaking: https://grafana.wikimedia.org/d/000000379/hive?orgId=1&var-instance=an-coord1001&viewPanel=7&from=now%2Fy&to=now&var-datasource=eqiad%20prometheus%2Fanalytics&var-cluster=analytics
[10:51:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-coord1001:10100 - https://alerts.wikimedia.org
[10:51:36] <btullis>	 I'll do a failover and restart.
[10:55:01] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define LVS load-balancing for OpenSearch cluster - https://phabricator.wikimedia.org/T301458 (10BTullis) This is the required change to start setting up the LVS configuration: https://gerrit.wikimedia.org/r/c/operations/pupp...
[10:55:43] <elukey>	 btullis: if possible let's do a dump of the heap space (or whatever is growing), so we can follow up later
[11:14:06] <wikibugs>	 10Analytics, 10Beta-Cluster-Infrastructure, 10Beta-Cluster-reproducible: 502, connect failed for intake-analytics.wikimedia.beta.wmflabs.org (Mar 2022) - https://phabricator.wikimedia.org/T303160 (10Majavah)
[11:30:28] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis)
[11:30:51] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) p:05Triage→03Medium
[11:33:37] <joal>	 Heya addshore - is now good?
[11:35:11] <addshore>	 Yes!
[11:35:35] <joal>	 addshore: meet.google.com/cek-oxpa-mge
[11:38:05] <btullis>	 !log failing over hive to an-coord1001 T303168
[11:38:09] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:38:10] <stashbot>	 T303168: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168
[11:51:31] <btullis>	 !log obtaining summary of heap objects and sizes: `hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16971 > hive-object-storage-and-sizes.T303168.txt`
[11:51:33] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[11:51:34] <stashbot>	 T303168: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168
[11:52:49] <btullis>	 !log obtaining heap dump: `hive@an-coord1001:/srv/hive-tmp$ jmap -dump:format=b,file=hive_server2_heap_T303168.bin 16971`
[11:52:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:10:12] <btullis>	 !log restarted hive-server2 process on an-coord1001
[12:10:14] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:18:30] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) ` Heap dump file created hive@an-coord1001:/srv/hive-tmp$ ls -lh total 8.3G -rw-r--r-- 1 hive hive  19M Mar  7 11:51 hive-object-storage-and-sizes.T303...
[12:21:57] <milimetric>	 ottomata: just fyi I accepted our -internal message (it was held 'cause "Message has more than 10 recipients")
[12:22:04] <milimetric>	 *your (not our)
[12:38:44] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10LSobanski) @Jenlenfantwright @LNguyen This task changed state twice (in February and in March) despit...
[12:45:44] <aqu_>	 !log About to deploy airflow-dags/analytics - Migrates wikidata/item_page_link 
[12:45:46] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[12:48:29] <dsaez>	 hey a-team, question: is it safe to consider that revision ids will be always correlative? ie. if rev_id_x > rev_id_y implies that timestamp(rev_id_x) > timestamp(rev_id_y), within the same project?
[12:50:02] <btullis>	 dsaez: I'm afraid I don't know. It seems logical, but I haven't got a categorical answer for you, sorry.
[12:50:30] <dsaez>	 btullis, np, thanks. 
[12:51:04] <joal>	 Hi dsaez - I wouldn't trust that - I agree it's natural, but sometimes data is not "logic" :)
[12:52:15] <dsaez>	 specially when it is mediawiki data :D thanks joal.
[12:53:11] <joal>	 dsaez: for an approximation, it should be fine (on most cases), but there always is special cases
[12:57:53] <dsaez>	 got it
[12:58:14] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Investigate trend of gradual hive server heap exhaustion - https://phabricator.wikimedia.org/T303168 (10BTullis) Reference heap dumps created from a freshly restarted hive-server2 process. ` hive@an-coord1001:/srv/hive-tmp$ jmap -histo:live 16616 > hive-object-sto...
[13:02:12] <wikibugs>	 (03CR) 10Aqu: [C: 03+2] "- Add HQL file triggered from Airflow" [analytics/refinery] - 10https://gerrit.wikimedia.org/r/763219 (owner: 10Aqu)
[13:02:38] <wikibugs>	 (03CR) 10Aqu: [V: 03+2 C: 03+2] Migrate wikidata/item_page_link/weekly from Oozie to Airflow [analytics/refinery] - 10https://gerrit.wikimedia.org/r/763219 (owner: 10Aqu)
[13:05:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[13:09:35] <aqu_>	 !log About to deploy analytics/refinery - Migrate wikidata/item_page_link/weekly from Oozie to Airflow
[13:09:37] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[13:10:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[13:30:50] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) Discussions that caused changes to the task here are all in the comments.  Some notes were...
[13:34:13] <wikibugs>	 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Event Carried State Transfer - Problem Statement - https://phabricator.wikimedia.org/T291120 (10Ottomata) Oh, BTW in case you weren't aware, the Decision Record we are submitting now is explicitly...
[13:45:02] <ottomata>	 joal: helloOoOOo
[13:46:16] <ottomata>	 also aqu_  looking into the wikidata/entity job failkure
[13:46:30] <ottomata>	 i'm not 100% sure what needs done. I see the MR you mentioned was merged
[13:46:44] <ottomata>	 should I just deploy airflow-dags?
[13:46:51] <ottomata>	 and then rerun the airflow job?
[13:54:32] <aqu_>	 Hello!
[13:57:06] <ottomata>	 aqu_:  hello!
[13:57:27] <aqu_>	 airflow-dags/analytics has been deployed. And I am finishing the deployment of refinery.
[13:57:32] <ottomata>	 oh okay!
[13:57:56] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10I18n: WikiReportsLocalizations.pm still fetches language names from SVN - https://phabricator.wikimedia.org/T64570 (10Ottomata) 05Open→03Declined Hi, my understanding is that WikiStats 1 is deprecated and maintained as a static site with minimal...
[13:58:09] <aqu_>	 But actually, I am doing it. Don't worry. Because it's important for me to check it's working well.
[13:58:50] <aqu_>	 So now, I am going to kill the Oozie job and schedule the AF one.
[13:59:24] <wikibugs>	 10Analytics-Wikistats, 10Data-Engineering, 10Browser-Support-Opera: Opera 15+ seems not to be recognized correctly - https://phabricator.wikimedia.org/T61816 (10Ottomata) This is not an ops week task.  I think this should be groommed with wikistats maintainers and prioritized accordingly.  Moving back to inc...
[13:59:39] <ottomata>	 okay, thanks aqu_ 
[13:59:53] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Confusing filtering on "Active editors by country" topic - https://phabricator.wikimedia.org/T300365 (10Ottomata) This is not an ops week task.  I think this should be groommed with wikistats maintainers and prioritized accordingly.  Moving back to inco...
[14:01:03] <joal>	 Hi ottomata :)
[14:01:52] <ottomata>	 hello!
[14:02:06] <ottomata>	 FYI amm discussing some things with filippo in #wikimedia-observability
[14:03:32] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering, 10Data-Engineering-Kanban, and 5 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) 05Open→03Resolved a:03Ottomata
[14:03:52] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Kanban, 10Data-Engineering-Radar, and 5 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) 05Resolved→03Open Oh, sorry should not have r...
[14:03:58] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Kanban, 10Data-Engineering-Radar, and 5 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata)
[14:14:37] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of rhuang-ctr - https://phabricator.wikimedia.org/T302194 (10Ottomata) Hi @CMacholan, please confirm that it is okay to delete the following data belonging to rhuang-ctr:  ` ====== stat1005 ====== total 16 drwxr-xr-x 7 34282 wikidev 4096 Nov  2 17:55 Editing-movement...
[14:25:34] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of ema - https://phabricator.wikimedia.org/T302815 (10Ottomata) 05Open→03Resolved a:03Ottomata Ema does not have any data on stat boxes or in HDFS or in Hive.  Removed his homedirs following https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Full_rem...
[14:27:44] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Ottomata) @Gehel  can you confirm that it is okay to delete the following files and homedirs belong to Zbyszko?  ` ====== stat1004 ====== total 240320 -rw-rw-r--  1 22656 wikidev       683 Sep 13 11:43 clus...
[14:44:02] <btullis>	 !log failing back hive services to an-coord1001
[14:44:04] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[14:49:20] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Gehel) Looks good to be deleted, but I'd like @dcausse to confirm.
[14:51:47] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Discussed in IRC with @joal and @fgiunchedi.  Summary:  For all task related metrics, we should be able to get Kaf...
[14:51:57] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Gehel) @Ottomata I confirmed with David, nothing to salvage here, please delete.
[14:53:04] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Check home/HDFS leftovers of bumeh-ctr - https://phabricator.wikimedia.org/T300607 (10Ottomata) a:05BTullis→03None Ok thanks, we will check back in during or after the week of March 21 to give you a little time for onboarding and to verify.  Tha...
[14:55:41] <wikibugs>	 10Data-Engineering: Check home/HDFS leftovers of zpapierski - https://phabricator.wikimedia.org/T302779 (10Ottomata) 05Open→03Resolved a:03Ottomata Thank you!  Done following https://wikitech.wikimedia.org/wiki/Data_Engineering/Ops_week#Full_removal_of_files_and_Hive_databases_and_tables   ` 14:23:07 [@an-...
[14:56:27] <jinxer-wm>	 (HiveServerHeapUsage) firing: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[14:58:30] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Data-Catalog: datahubsearch nodes alerting with "Rate of JVM GC Old generation-s runs" - https://phabricator.wikimedia.org/T302818 (10BTullis) > I would move this section: >  > `file { '/usr/share/opensearch/plugins': >    ensure => 'directory', >...
[14:59:01] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10JAllemandou) I have a wonder about option 2: the Prometheus PushGateway doc says, about using 'POST': //POST works exactly l...
[14:59:03] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Data-Catalog: Complete monitoring setup of datahubsearch nodes - https://phabricator.wikimedia.org/T302818 (10BTullis) p:05Triage→03Medium
[15:01:27] <jinxer-wm>	 (HiveServerHeapUsage) resolved: Hive Server JVM Heap usage is above 80% on an-test-coord1001:10100 - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hive/Alerts#Hive_Server_Heap_Usage - https://grafana.wikimedia.org/d/000000379/hive?panelId=7&fullscreen&orgId=1&var-instance=an-test-coord1001:10100 - https://alerts.wikimedia.org
[15:03:29] <joal>	 gone for kids - back at standup
[15:03:39] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Define the Helm charts and helmfile deployments for Datahub - https://phabricator.wikimedia.org/T301454 (10BTullis) Moving this to //in review// whilst {T303049} is being handled by the Service Ops team.
[15:06:57] <wikibugs>	 (03CR) 10Aklapper: "This has been merged but will never be deployed" [analytics/wikistats] - 10https://gerrit.wikimedia.org/r/316289 (https://phabricator.wikimedia.org/T64570) (owner: 10Paladox)
[15:14:14] <addshore>	 joal: so you would recommend a join over something like `AND page_title IN ( SELECT * FROM addshore_temp_topic_pages_all )` ?
[15:18:04] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Good catch, I just verified this.  Without using a distinct groupingKey, tasks will override each other's metrics...
[15:18:56] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Hm, since we have to use distinct groupingKeys per task always, I guess it doesn't matter if we use POST or PUT.  Hm.
[16:00:10] <joal>	 addshore: I'd express that as a join rather than an IN, but yes, that is it
[16:05:33] <wikibugs>	 10Data-Engineering-Kanban: Projectviews by country Airflow job - https://phabricator.wikimedia.org/T303193 (10Milimetric)
[16:05:50] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban: Projectviews by country Airflow job - https://phabricator.wikimedia.org/T303193 (10Milimetric) a:03Milimetric
[16:17:11] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Radar, 10Product-Analytics, and 4 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata) a:05Ottomata→03None
[16:17:24] <wikibugs>	 10Analytics, 10Analytics-Wikistats, 10Data-Engineering-Radar, 10Product-Analytics, and 4 others: Wikistats pageview data missing counts for Mobile App pageviews on Commons, going back to 2020-11 - https://phabricator.wikimedia.org/T299439 (10Ottomata)
[16:25:35] <joal>	 milimetric: would you please review the events I added to the timeline?
[16:25:52] <joal>	 Still adding some, but most of the thing is in place I think
[16:26:05] <milimetric>	 I just did joal, thank you!  I was taking forever finding that Feb 16 date and by the time I went back to the timeline you had easily won :)
[16:31:37] <joal>	 ottomata: looks like we have some errors for druid jobs - may I leet you investigate?
[16:53:48] <wikibugs>	 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot traffic anomaly detection job - https://phabricator.wikimedia.org/T303199 (10mforns)
[16:55:17] <wikibugs>	 10Data-Engineering, 10Airflow: [Airflow] Troubleshoot traffic anomaly detection job - https://phabricator.wikimedia.org/T303199 (10mforns) https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/28
[16:58:08] <wikibugs>	 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow: Airflow-triggered Spark-jobs produce hdfs-files belonging to the wrong hdfs-user-group - https://phabricator.wikimedia.org/T303201 (10Antoine_Quhen)
[17:39:29] <ottomata>	 joal:  looking
[17:43:22] <ottomata>	 btullis: an oozie ruid load job has been failling for aboutt 3 or 4 hours, am wondering if it is related to an-coord1001 failover stuff?
[17:51:50] <ebernhardson>	 i dunno if related, but we are getting this from a regular job that reads web requests now: org.apache.hadoop.security.AccessControlException: Permission denied: user=analytics-search, access=READ_EXECUTE, inode="/wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7/hour=14":analytics:hdfs:drwxr-x---
[17:52:09] <ottomata>	 thank you ebernhardson  interesting.
[17:56:23] <joal>	 ottomata: my assumption is that this is related to your change on hive config about group ownership - the hive-server probably hadn't been restarted, and the change only got picked
[17:56:30] <ottomata>	 hmmmmmmmmmmm
[17:56:38] <ottomata>	 interesting.
[17:57:29] <ottomata>	 https://phabricator.wikimedia.org/T291664
[17:58:51] <joal>	 ottomata: hdfs dfs -ls /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7/
[17:59:15] <joal>	 ottomata: different group, and we have disabled all-readership
[17:59:23] <ottomata>	 yes
[17:59:56] <ottomata>	 but, i hadn't expected the ownership to be affected, just the  perms
[18:00:00] <ottomata>	 but, investigating
[18:00:04] <ottomata>	 i think you are probably right
[18:34:33] <ottomata>	 !log restarting hive-server2 on an-coord1001 to revert hive.warehouse.subdir.inherit.perms change - T291664
[18:34:36] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:34:36] <stashbot>	 T291664: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664
[18:35:16] <ottomata>	 yes, joal i'm not quite understanding why tho; from what I can tell by disabling the thing, it should have used default hadoop perms, which inherit from parent.
[18:35:21] <ottomata>	 i'm reverting
[18:37:04] <ottomata>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/webrequest/webrequest_source=text/year=2022/month=3/day=7 - after reverting - T291664
[18:37:05] <joal>	 ottomata: we'll know from your revert if this is it
[18:37:07] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:37:15] <ottomata>	 yeajh
[18:38:52] <joal>	 ottomata: there are some folders to update because of this
[18:38:55] <ottomata>	 yes
[18:39:24] <joal>	 let me list here the ones I can think of - webrequest (text, upload), pageview_actor, pageview, projectview
[18:40:24] <joal>	 virtual_pageview
[18:40:41] <joal>	 and I think that's it
[18:40:57] <joal>	 ok team, gone for tonight
[18:42:20] <ottomata>	 hmm, why  is the load-wf-text job failing
[18:42:24] <ottomata>	 i see that it previously wrote bad perms
[18:42:33] <ottomata>	 oh, maybe it was my hive restart
[18:45:49] <ottomata>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/mediacounts/year=2022/month=3/day=7
[18:45:49] <ottomata>	 - T291664
[18:45:51] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[18:45:51] <stashbot>	 T291664: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664
[18:46:40] <ottomata>	 ? i can't rerun via hue anymore?
[18:46:45] <ottomata>	 JA009: org.apache.hadoop.yarn.exceptions.YarnException: org.apache.hadoop.security.AccessControlException: User otto does not have permission to submit application_1637058075222_691339 to queue production
[18:47:00] <joal>	 you shpouldn't run it as yourself ottomata 
[18:47:09] <joal>	 ou probably have ticked something wrong
[18:47:25] <ottomata>	 user is set to analytics
[18:47:50] <ottomata>	 trying again from coordinator level
[18:57:42] <ottomata>	 seems to work from coordinator level
[18:57:50] <ottomata>	 just not when viewing it from workflow level
[19:03:48] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Milimetric) One question about the new proposed development...
[19:13:30] <ottomata>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/virtualpageview/hourly/year=2022/month=3/day=7 - revert of T291664
[19:13:32] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:13:32] <stashbot>	 T291664: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664
[19:14:45] <ottomata>	 !log sudo -u hdfs kerberos-run-command hdfs hdfs dfs -chgrp -R analytics-privatedata-users /wmf/data/wmf/*/hourly/year=2022/month=3/day=7 to make sure perms are fixed after revert  of T291664
[19:14:48] <stashbot>	 Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log
[19:20:21] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) Yes, but hopefully the CI will not allow you to m...
[19:26:09] <wikibugs>	 10Analytics-Clusters, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Set hive.warehouse.subdir.inherit.perms to false - https://phabricator.wikimedia.org/T291664 (10Ottomata) I had to revert this change, as it somehow broke group ownership for some of the files our oozie loading jobs cr...
[19:29:38] <wikibugs>	 10Data-Engineering: Analysis: incomplete webrequest records - https://phabricator.wikimedia.org/T303215 (10Milimetric)
[19:31:30] <wikibugs>	 10Data-Engineering: Analysis: incomplete webrequest records - https://phabricator.wikimedia.org/T303215 (10Milimetric)
[19:49:44] <wikibugs>	 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10mpopov) If I remember right `npm test` fails if the updated...
[20:15:15] <milimetric>	 ottomata: when I run the run_dev_instance.sh script from airflow-dags, how come it only shows me one dag in the UI and I can't see other DAGs?
[20:28:31] <ottomata>	 what are you settting as your dags folder?
[20:36:10] <ottomata>	 milimetric: ^
[20:38:24] <milimetric>	 so I'm trying to find a good dev flow.  Right now, trying:
[20:38:39] <milimetric>	 * send merge request on a branch
[20:38:48] <milimetric>	 * checkout on stat box
[20:39:01] <milimetric>	 * ./run_dev_instance.sh airflow
[20:39:59] <milimetric>	 this morning talking with Marcel and others I understood that when I do this, it loads up all the jobs in that folder and pauses them by default.  So I was assuming that I could tunnel and unpause the job I'm working on, while tweaking the config
[20:40:34] <ottomata>	 ./run_dev_instance.sh airflow ?
[20:40:38] <mforns>	 makes sense
[20:40:41] <ottomata>	 i think you want maybe ./run_dev_instance.sh analytics
[20:40:41] <ottomata>	 ?
[20:40:43] <ottomata>	 or analytics_test
[20:40:44] <ottomata>	 ?
[20:40:52] <mforns>	 ah! yes
[20:41:47] <mforns>	 and also set your desired port with '-p <port_num>' this avoids that several devs collide in the same port
[20:45:03] <milimetric>	 sorry, I meant ./run_dev_instance analytics, that's what I'm doing.  I double checked 'cause I'm an airhead
[20:46:08] <milimetric>	 oh won't run_dev_instance quit if the port is taken? 
[20:46:37] <wikibugs>	 (03PS6) 10Ottomata: [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal)
[20:46:52] <ottomata>	 milimetric:  which stat box?
[20:46:59] <milimetric>	 stat1004
[20:47:11] <mforns>	 milimetric: yes, I guess it won't work if the port is taken
[20:47:31] <ottomata>	 milimetric:  that sounds like it should work to mme
[20:47:39] <mforns>	 milimetric: wanna cave?
[20:47:49] <milimetric>	 the dag I see is useragent_distribution
[20:47:50] <milimetric>	 sure!
[20:47:58] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] Add prometheus metrics reporter [analytics/gobblin-wmf] - 10https://gerrit.wikimedia.org/r/767178 (owner: 10Joal)
[20:48:16] <milimetric>	 (in the cave)
[20:51:05] <mforns>	 ok omw
[20:54:56] <ottomata>	 milimetric:  just killed my 8080 webserver
[22:10:38] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) Hmph.  Okay I got some stuff working, although I'm not so sure anymore about calling `delete` on job startup.  I c...
[22:15:15] <wikibugs>	 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Send some existing Gobblin metrics to prometheus - https://phabricator.wikimedia.org/T294420 (10Ottomata) @fgiunchedi Q for you.  I think using `task_number` in the groupingKey will increase cardinality.  AFAIK the assoc...