[09:29:16] morning team. [09:32:41] Morning ! [09:40:23] Just FYI I am looking into the failure of the `mediawiki-history-drop-snapshot.service` job. [09:50:48] That job finished with an error: `ERROR Selected partitions extracted from table specs ({'snapshot=2021-12-06', 'snapshot=2021-11-29'}) does not match selected partitions extracted from data paths (set()). HDFS directories to check: []` [09:51:03] Continuing to investigate. [10:40:03] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) I have installed the hive connector and run the first ingestion. First I had to install pyhive in my conda environment by following the guidelines here: https://... [14:02:22] I'm going to try re-running the service, but I don't hold out much hope that it won't fail in the same way. [14:03:05] !log btullis@an-launcher1002:~$ sudo systemctl start mediawiki-history-drop-snapshot.service [14:03:07] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:08:14] RECOVERY - Check unit status of mediawiki-history-drop-snapshot on an-launcher1002 is OK: OK: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:21:32] PROBLEM - Check unit status of mediawiki-history-drop-snapshot on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:41:59] joal and I have found that there are success flags missing from two recent snapshot directories. [14:42:03] https://www.irccloud.com/pastebin/kKBMav9d/ [14:50:56] We created those file manually snd restarted the service. [14:51:00] https://www.irccloud.com/pastebin/Yhl8LdDw/ [14:51:14] !log btullis@an-launcher1002:~$ sudo systemctl start mediawiki-history-drop-snapshot.service [14:51:16] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [15:03:16] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) I have: * installed the kafka plugin with: `pip install 'acryl-datahub[kafka]'` * created a basic recipe for importing topics from the kafka-jumbo cluster: ` sour... [15:07:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog: Run Datahub on test cluster - https://phabricator.wikimedia.org/T299703 (10BTullis) I have also ingested from both the analytics and public druid clusters. ` btullis@stat1008:~/src/datahub/ingestion$ cat druid.yml an-druid.yml source: type: "... [15:13:08] RECOVERY - Check unit status of mediawiki-history-drop-snapshot on an-launcher1002 is OK: OK: Status of the systemd unit mediawiki-history-drop-snapshot https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [16:06:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Infrastructure-Foundations, 10SRE, and 3 others: Collect netflow data for internal traffic - https://phabricator.wikimedia.org/T263277 (10JAllemandou) 05Open→03Resolved [16:06:36] 10Data-Engineering, 10Data-Engineering-Kanban, 10Metrics-Platform: Add user agent client hints to the `webrequest` table - https://phabricator.wikimedia.org/T299402 (10JAllemandou) 05Open→03Resolved [16:06:40] 10Data-Engineering, 10Anti-Harassment, 10Metrics-Platform, 10Privacy Engineering, and 3 others: Measure user-agent client hints already sent in browsers requests - https://phabricator.wikimedia.org/T299397 (10JAllemandou) [16:14:24] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Run OpenMetadata in test cluster - https://phabricator.wikimedia.org/T300540 (10Milimetric) [16:36:39] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) For posterity sake: | 1010-b / 3.11.11 | | ---- | | {F34938213} | | 1014-a / 3.11.4 | | ---... [16:44:05] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10JAllemandou) Yes ! The pattern difference is even more visible [[https://grafana-rw.wikimedia.org/exp... [16:46:53] a-team eevans is going to proceed with the upgrade of the remaining aqs_next nodes to Cassandra 3.11.11 [16:47:07] ack btullis thanks for the ping :) [16:47:20] thanks! [16:47:45] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10BTullis) @Eevans is going to go ahead with the upgrade of the remaining nodes. [17:01:32] 10Analytics, 10Data-Engineering, 10Event-Platform: jsonschema-tools tests should fail if schema $id does not match title or path - https://phabricator.wikimedia.org/T300404 (10Ottomata) a:05Ottomata→03None [17:02:32] 10Data-Engineering, 10Data-Engineering-Kanban: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10Ottomata) Cool, okay! Antoine, would you like to learn how to make a Puppet patch to do this? I can help show you how. (you can say no! :) ) [17:06:53] 10Data-Engineering, 10Data-Engineering-Kanban: Set SparkmaxPartitionBytes to 256MB - https://phabricator.wikimedia.org/T300299 (10Ottomata) Actually, looking at code it might be not obvious to do it nicely. I think I can make this match the hadoop setting by default. [17:28:59] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) The upgrade to 3.11.11 is complete. [17:31:15] aqu1: (you are antoine, yes?) [17:31:22] joal & aqu1 : https://gerrit.wikimedia.org/r/c/operations/puppet/+/758529 [17:31:27] look right? [17:31:38] https://puppet-compiler.wmflabs.org/pcc-worker1003/33514/stat1004.eqiad.wmnet/index.html [17:47:34] ottomata: it looks correct indeed :) [17:49:53] (03PS15) 10Phuedx: [WIP] Metrics Platform event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [17:50:34] (03CR) 10Phuedx: [WIP] Metrics Platform event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/676392 (https://phabricator.wikimedia.org/T276379) (owner: 10Jason Linehan) [18:08:56] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10JAllemandou) Thanks a lot @Eevans for the upgrade! I assume we'll leave it bake for a few days before... [18:10:16] joal: I was just looking at that drop snapshot thing, just saw your email, sorry for the cross-talk, all good [18:11:03] ack milimetric :) we need to talk about how we mitigate that with Airflow (no need for _SUCCESS files in Airflow with the hive sensor) [18:11:49] yes, data retention in general is changing and will change again with Iceberg. I don't see any easy fix, just manually updating [18:37:37] 10Analytics, 10Contributors-Team, 10MediaWiki-extensions-WikimediaEvents, 10mediawiki-extensions-eventlogging: Remove userAgent from Schema:PageContentSaveComplete - https://phabricator.wikimedia.org/T104863 (10Aklapper) T123958#7015355 says "These schemas will probably be retired when schema migrate to ME... [18:57:15] 10Analytics: Kerberos identity for bmansurov - https://phabricator.wikimedia.org/T300450 (10leila) Approved. And welcome back bmansurov. :) [19:06:15] 10Analytics-Clusters, 10Data-Engineering, 10Data-Engineering-Kanban, 10Cassandra, and 3 others: Investigate high levels of garbage collection on new AQS nodes - https://phabricator.wikimedia.org/T298516 (10Eevans) >>! In T298516#7665016, @JAllemandou wrote: > Thanks a lot @Eevans for the upgrade! > I assum... [19:22:38] (03CR) 10Ebernhardson: [C: 03+2] rdf_streaming_updater/reconcile: fix schema id [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/757879 (owner: 10DCausse) [19:23:21] (03Merged) 10jenkins-bot: rdf_streaming_updater/reconcile: fix schema id [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/757879 (owner: 10DCausse) [19:46:38] heya ottomata :] should the workflow_utils lib be accessible from analytics' airflow instance? [19:47:05] I'm getting import errors when testing the dependency thing [19:47:07] no i think i never deployed the .deb there [19:47:14] lemme see [19:49:53] hm no it is there [19:50:20] yes it is accessible there mforns [19:50:27] what's going wrong? [19:50:31] oh, ok, then will troubleshoot! [19:50:46] It says no module named workflow_utils [19:50:47] 19:49:48 [@an-launcher1002:/home/otto] $ /usr/lib/airflow/bin/ipython [19:50:47] In [1]: import workflow_utils [19:50:48] worked for me [19:51:00] ha! [19:51:07] ok, will troubleshoot! thanks :] [19:51:16] k [19:58:42] ottomata: my bad, I was using an old conda env for the dev instance, recreated it and it worked. thanks! [19:58:50] aye coo [21:53:24] 10Analytics, 10Contributors-Team, 10MediaWiki-extensions-WikimediaEvents, 10mediawiki-extensions-eventlogging: Remove userAgent from Schema:PageContentSaveComplete - https://phabricator.wikimedia.org/T104863 (10Krinkle) 05Open→03Resolved a:03Nuria According to the talk page for this Schema on Meta-Wi... [21:57:38] 10Analytics, 10SRE, 10SRE-Access-Requests: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) [21:58:19] 10Analytics, 10SRE, 10SRE-Access-Requests: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10DannyH) I hope that I've done this correctly; please let me know if I've made a mistake. Thanks! [22:22:11] 10Analytics, 10SRE, 10SRE-Access-Requests: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10RhinosF1) Adding Andrew & Olja as they normally approve for this group. @DannyH: it looks good. @Ladsgroup is on clinic duty this week and will pick it up for you! Please get yo... [22:42:48] 10Analytics, 10SRE, 10SRE-Access-Requests: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ottomata) Approved! [22:43:26] 10Analytics, 10SRE, 10SRE-Access-Requests: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10Ottomata) Looks like Danny will not need shell access, just ssh-keyless group membership. [23:48:00] 10Analytics, 10SRE, 10SRE-Access-Requests: Add dhorn to analytics-privatedata-users - https://phabricator.wikimedia.org/T300579 (10CDunn) Approved