[00:30:12] 10Data-Engineering: Some varnishkafka instances dropped traffic for a long time due to the wrong version of the package installed - https://phabricator.wikimedia.org/T300164 (10kzimmerman) Thanks @Milimetric! Here's the main task we were using as we investigated pageviews: https://phabricator.wikimedia.org/T296875 [04:22:25] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_immediate on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_immediate https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [06:53:27] ottomata: thanks a lot for eventstreams <3 [07:01:12] I am wondering if eventlogging should be moved to kafka port 9093 though [07:05:42] 10Analytics-Radar, 10Data-Engineering-Radar, 10Event-Platform, 10Patch-For-Review: Move Kafka Jumbo's TLS clients to the new bundle - https://phabricator.wikimedia.org/T296064 (10elukey) From what I can see from netstat on Jumbo nodes, all the clients that may be affected by this transition have been porte... [07:09:51] 10Data-Engineering, 10serviceops, 10Patch-For-Review: Move kafka clusters to fixed uid/gid - https://phabricator.wikimedia.org/T296982 (10elukey) 05Resolved→03Open Forgot a couple of things to do for a complete cleanup: 1) We should move deployment-prep's clusters as well to the fixed uid/gid. 2) The `p... [08:04:17] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [08:13:52] Hi btullis - happy friday :) [08:15:13] btullis: the memory-leak pattern on AQS new nodes is getting clearer and clearer - thanks a lot for investigating,finding and demonstrating the solution :) [09:19:16] Thanks joal. <3 I'll speak to eevan.s to find out what his preferred approach would be to rolling out 3.11.11 more widely. [09:45:26] 10Data-Engineering, 10Data-Engineering-Kanban, 10Airflow, 10Epic: Low Risk Oozie Migration: Mediawiki History Dumps - https://phabricator.wikimedia.org/T300344 (10JAllemandou) [09:47:51] joal: As it is, I'd be inclined to depool the one aqs_next host (aqs1011) at some point today, so we don't risk an outage over the weekend. [09:52:10] butllis: hm, I hear your point, but I also would love to confirm the pattern for a few days (heap of the 1010 node is somewhat different, but doesn't lower as much as I'd like) - Do you wish we depool and wait next week, or we keep it this weekend? [09:56:19] I'm happy to take advice from others, so I'm just going to comment on the ticket, leave it pooled, and see if Eric has an opinion later today. When we ran it last time it lasted over the two week holidays before it crashed on one instance, so tihe risk of leaving it this weekend is quite low I think. [09:58:09] We may want to take another heap dump too, in case we think that there is still a memory leak of some kind even with 3.11.11 [09:58:15] btullis: I think it is fine to leave it running, maybe you can send an email to ops@ informing people about the experiment and telling to depool in case something feels wrong [09:58:33] elukey: Thanks. Good idea. [09:58:36] (I will not be online sunday/monday so I cannot even say that I'll depool if I see something weird) [09:58:50] (I can only check tomorrow :D) [09:59:51] Even you need time away from the keyboard Luca :D [10:01:21] https://usercontent.irccloud-cdn.com/file/bYNbcXgz/image.png [10:02:08] It's interesting that even the two instances on aqs1010 (the upgraded node) are so different from each other. [10:12:21] I'm out and about both Saturday and Sunday daytime, which is why I was maybe a bit twitchy about it, but I will be here on Monday. [10:16:36] 10Quarry, 10Discovery, 10VPS-Projects, 10Wikidata, and 3 others: Setup sparqly service at https://sparqly.wmflabs.org/ (like Quarry) - https://phabricator.wikimedia.org/T104762 (10So9q) >>! In T104762#3805254, @Lucas_Werkmeister_WMDE wrote: >>>! In T104762#2635939, @Multichill wrote: >> With the current S... [10:22:48] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) I'd say that the pattern is clearer again this morning, with both aqs1010 instances exhibiti... [10:35:54] (03PS6) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [10:46:54] (03PS1) 10DCausse: rdf_streaming_updater/reconcile: fix schema id [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/757879 [10:51:18] (03PS7) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [11:09:49] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) As per the previous update: T299703#7648663 I have started using stat1008 for development, but I am doing all work as my own unprivileged user account. No root. N... [11:34:39] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) The kafka topic names are as follows, from: https://github.com/linkedin/datahub/blob/master/docker/kafka-setup/Dockerfile ` ENV METADATA_AUDIT_EVENT_NAME="Metadat... [12:02:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) I have pre-configured opensearch, but ran into one smalll issue that I hope will not cause a problem at this stage. The setup script has two parts: * Install an... [12:23:15] 10Analytics, 10Analytics-Wikistats: Confusing filtering on "Active editors by country" topic - https://phabricator.wikimedia.org/T300365 (10Pcoombe) [12:42:25] 10Analytics, 10Analytics-Wikistats, 10Data-Engineering: Wikistats New Feature - https://phabricator.wikimedia.org/T299820 (10Pcoombe) 05Open→03Stalled You need to provide at least some information on what new feature you are requesting. Please see https://www.mediawiki.org/wiki/How_to_report_a_bug for gu... [12:43:38] 10Analytics, 10Analytics-Wikistats: Wikistats New Feature - regional and group evaluations - https://phabricator.wikimedia.org/T264512 (10Pcoombe) [13:17:02] elukey: fyi the pki cloud environment should have the same intermidiates as production now [13:17:08] let me know if you hit an issue [13:32:29] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) The datahost-gms service is up and running. [13:40:59] (03PS8) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [13:57:12] jbond: <3 [14:20:03] (03PS9) 10Joal: Integrate SparkSQLNoCLIDriver and HiveToCassandra [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) [14:33:40] (03CR) 10Joal: [V: 03+1] "Finally! this has been tested on the cluster and can be merged after +1s :)" [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/757495 (https://phabricator.wikimedia.org/T297934) (owner: 10Joal) [14:39:22] btullis: if you have a few minutes I wish to share with you my latest ideas of Spark as a presto replacement [14:39:34] Sure thing. bc? [14:41:56] Actually, maybe I should get a coffee. 45 past? [14:43:01] Arf btullis - I just got a call from a neighbour asking for help , and after that I'm going for kids - either later after kids or on Monday :) [14:43:36] sorry for the counter-plan :S [14:43:49] And, gone for now, back after kids [14:50:18] No worries.Look forward to it whenever :-) [16:13:26] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10Eevans) a:03codebug [16:30:27] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10Eevans) After some discussion, we decided to rewrite/refactor our existing (unit-style) tests as integration tests (using Go). I've stubbed out tests for the //per-... [16:32:51] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10Eevans) p:05Triage→03Medium [16:45:36] 10Data-Engineering, 10Platform Engineering Roadmap, 10User-Eevans: Pageviews integration testing - https://phabricator.wikimedia.org/T299735 (10Eevans) [16:56:22] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) >>! In T298516#7659099, @BTullis wrote: [ ... ] > @Eevans - what course of action do you thin... [17:01:50] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) > As a next-step, I propose to complete the upgrade to 3.11.11, and repeat the test for a co... [17:22:09] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) I've now got all services up and running so datahub is accessible on port 9002 of stat1008 via SSH tunneling: `ssh -N -L9002:localhost:9002 stat1008.eqiad.wmnet`... [17:24:41] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) To begin with, I have installed the `datahub` client in a conda environment with: ` source conda-activate-stacked pip install 'acryl-datahub[datahub-rest]' ` This... [17:25:53] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 2 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10JAllemandou) Good for me too! Thanks @BTullis and @Eevans :) [17:27:08] Hi btullis - I have some time now if you're around, otherwise next week :) [17:27:27] Yes, let's go. Batcave? [17:27:32] On my way! [19:53:25] (03CR) 10Ebernhardson: [C: 03+1] "Seems reasonable, but wondering if changing an existing (invalid) schema requires any manual steps as well" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/757879 (owner: 10DCausse) [21:18:12] 10Analytics, 10Data-Engineering, 10Event-Platform: jsonschema-tools tests should fail if schema $id does not match title or path - https://phabricator.wikimedia.org/T300404 (10Ottomata) [21:19:02] (03CR) 10Ottomata: "Hm, I'm surprised tests let this through! Filed a task to investigate whey they did: https://phabricator.wikimedia.org/T300404" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/757879 (owner: 10DCausse) [21:19:10] (03CR) 10Ottomata: [C: 03+1] rdf_streaming_updater/reconcile: fix schema id [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/757879 (owner: 10DCausse) [21:19:54] I'm coming soooo late on a Friday to this channel, and I may be sharing with you something that you (the data engineering team) already know. but I thought let me say it here just in case. :D see below. [21:21:32] sooo, in https://arxiv.org/abs/2201.00812 surfaces the importance of the clickstream data that your team maintains really clearly. The paper argues that for the majority of the research questions that external (w.r.t. WMF) researchers hope to answer synthetic data and cliclstream data would be sufficient. [21:22:09] Thank you for the work you have been doing to maintain this data, and responding to public needs that otherwise would quickly boil down to access to the webrequest logs. [21:22:44] that's it. ;) [21:23:37] wow cool! leila I will paste your comment on our team slack so that olja sees it too :) [21:23:55] nice nice. thanks! :) (I didn't know how to do this in slack. :D) [21:37:04] Thanks leila. That article looks fascinating! I'll give it a good read over the weekend. [22:19:06] btullis: oh! nice. credit to all the authors. enjoy! :) [22:36:39] <3 [22:40:58] it's nice, no?! <3 [23:40:03] "More broadly, this study provides an example for how clickstream-like data can generally enable research on user navigation on online platforms while protecting users' privacy." -- I like that bit! :)