[00:05:56] 10Data-Engineering, 10Data-Persistence, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) [00:06:49] 10Data-Engineering, 10Data-Persistence, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10Eevans) p:05Triage→03Medium [03:22:25] 10Data-Engineering, 10Platform Engineering Roadmap: Audit/review pageviews test cases - https://phabricator.wikimedia.org/T305502 (10BPirkle) I did an initial inventory of the current production service test file ([[ https://gerrit.wikimedia.org/r/plugins/gitiles/analytics/aqs/+/refs/heads/master/test/features... [05:10:26] 10Data-Engineering: PySpark is unable to find Hive tables - https://phabricator.wikimedia.org/T305457 (10bmansurov) 05Open→03Resolved Thank you, both! [07:23:05] (03CR) 10Awight: "This change is ready for review." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/778207 (https://phabricator.wikimedia.org/T305028) (owner: 10Awight) [07:24:35] (03CR) 10jerkins-bot: [V: 04-1] Make some fields optional [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/778207 (https://phabricator.wikimedia.org/T305028) (owner: 10Awight) [07:55:12] (VarnishkafkaNoMessages) firing: ... [07:55:12] varnishkafka for instance cp3050:9132 is not logging cache_text requests from eventlogging - https://wikitech.wikimedia.org/wiki/Analytics/Systems/Varnishkafka - https://grafana.wikimedia.org/d/000000253/varnishkafka?orgId=1&var-datasource=esams%20prometheus/ops&var-source=eventlogging&var-cp_cluster=cache_text&var-instance=cp3050:9132&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DVarnishkafkaNoMessages [08:16:46] (03Abandoned) 10Awight: Make some fields optional [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/778207 (https://phabricator.wikimedia.org/T305028) (owner: 10Awight) [08:17:34] ^ Maybe I'm not doing this right, but it feels very awkward that I can't make required legacy fields optional in a new schema. [08:20:51] ... and since the value "0" had a meaning, I'll have to send a magic number like "-1" in these deprecated fields, which is something better avoided. [08:27:46] RECOVERY - eventgate-analytics-external validation error rate too high on alert1001 is OK: (C)2 gt (W)1 gt 0.89 https://wikitech.wikimedia.org/wiki/Event_Platform/EventGate https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-analytics-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos [08:46:38] (03PS1) 10Btullis: Add wmf-certificates to each datahub container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/778224 (https://phabricator.wikimedia.org/T301453) [09:10:50] (03CR) 10Btullis: [C: 03+2] Add wmf-certificates to each datahub container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/778224 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [09:23:56] (03Merged) 10jenkins-bot: Add wmf-certificates to each datahub container [analytics/datahub] (wmf) - 10https://gerrit.wikimedia.org/r/778224 (https://phabricator.wikimedia.org/T301453) (owner: 10Btullis) [09:36:16] btullis, razzi: Hey, do you have a prescription about permission level concerning archive files (in /wmf/data/archive)? There is no harmony in this directory. [09:36:16] The current Oozie job https://github.com/wikimedia/analytics-refinery/tree/master/oozie/util/archive_job_output did set the defaults to [09:36:16] - 644 for file permission [09:36:16] - and 022 for dir umask [09:36:16] Here is an example of a result: [09:36:17] /wmf/data/archive/unique_devices/per_project_family/2022/2022-03/unique_devices_per_project_family_daily-2022-02-01.gz [09:36:18] I am currently keeping those defaults in the Airflow job. Shall I? [09:52:55] (03CR) 10Gehel: [C: 04-1] "See comment inline and feel free to ping me for more discussion." (032 comments) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/772027 (https://phabricator.wikimedia.org/T300029) (owner: 10Aqu) [09:54:00] joal, ottomata: feel free to ping me about the comments in the CR above. I have concerns about thread safety. [09:54:49] Note that those concerns are not strictly related to the change in this CR, I can remove my -1 if you think it makes more sense to address those in a different place / time [09:55:26] aqu: cc ' [09:55:52] Thanks gehel for the comments - makes a lot of sense [09:57:51] let me know if you want more discussion on the subject! Concurrency is hard! [09:59:54] Thank for the comments. I am wondering what kind of incoherence may occure. Also a locking mechanism would defy the purpose of caching here, that is why I didn't go this way. I will check Guava. [10:00:54] (03CR) 10Gehel: [C: 04-1] Fix: Prevent empty normalized host (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/772027 (https://phabricator.wikimedia.org/T300029) (owner: 10Aqu) [10:01:26] aqu: note that ConcurrentHashMap is efficient in terms of locking (I added a comment about that as well). [10:01:48] It would loose the LRU part of that cache, which is an issue in this case [10:02:37] not synchronizing for both read and writes in case of concurrent access is almost always an error [10:09:57] otto: What are the plans considering Airflow executors? Currently, localExecutors have access to the jars ( /srv/deployment/analytics/refinery/artifacts/refinery-job-shaded.jar). Can I expect the same if we switch to Kubernetes? [10:14:14] If you want to dig into the details of the Java Memory Model, have a look at https://shipilev.net/blog/2016/close-encounters-of-jmm-kind/ [13:36:41] aqu: Regarding "Can I expect the same if we switch to Kubernetes?" - I would say at this point - probably, but we can think about it again nearer the time. [13:37:30] We're some way away from running airflow executors in k8s (I think) so an HDFS path is fine for now. [13:39:38] btullis: Thanks. You mean a local path? [13:40:49] OH! I see what you mean now. :-) [13:40:56] Airflow is going to exec something like: `java -cp /local/path/to/myjar-shaded.jar:/also/local/hadoop.jar MyClass --param=1` [13:41:22] Under which user account does it run? [13:41:30] analytics [13:43:00] Currently on an-launcher1002 the refinery-job jar is here. [13:43:56] OK, so the analytics user *does have* access to the shared HDFS path where the jars are deployed. I'm not immediately sure whether it is best to use the local path or the HDFS path. [13:46:17] As in: `btullis@an-launcher1002:~$ sudo -u analytics kerberos-run-command analytics hdfs dfs -ls /wmf/refinery/current/artifacts/` [13:49:24] Yes, it does. I am using the local file in order to avoid downloading the refinery-job-shaded (110MB) from HDFS each time this little peace of code is running. [13:49:45] However, come to think of it, other people ase probably better placed to answer the question about that to do for the best. Certainly a local path won't work under k8s, but that's not an issue yet. [13:50:09] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:51:07] OK, got it. You're probably fine to proceed with a local file then, as long as we bear it in mind for the future. [14:00:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:47:36] (03CR) 10AGueyte: Add event_ipinfo_version to ipinfo_interaction schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) (owner: 10Tchanders) [14:48:57] 10Data-Engineering, 10Cassandra, 10User-Eevans: Properly add aqsloader user (w/ secrets) - https://phabricator.wikimedia.org/T305600 (10LSobanski) [15:05:33] (03PS2) 10Tchanders: Add event_ipinfo_version to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) [15:05:49] (03CR) 10Tchanders: Add event_ipinfo_version to ipinfo_interaction schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) (owner: 10Tchanders) [15:06:09] (03CR) 10jerkins-bot: [V: 04-1] Add event_ipinfo_version to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) (owner: 10Tchanders) [15:59:23] mforns: I think those fixes could be cherrypicked to aiflow-dags/main : https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/41/diffs?commit_id=3aa34b75922d16520d2a5f896a35def5ee872115 What do you think ? [16:08:40] aqu looking! [16:42:30] aqu, thanks a lot for fixing that. I don't see how tests were passing for me.. change looks good to me! I left a couple comments regarding the use of macros, LMK your opinions :] [16:44:47] not sure if the comments get attached to the actual commit in GitLab... [17:10:24] (03PS3) 10Tchanders: Add event_ipinfo_version to ipinfo_interaction schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/777816 (https://phabricator.wikimedia.org/T296417) [17:13:07] Hey ottomata we introduced a typo into one of our schemas and are not sure what to do about it https://gerrit.wikimedia.org/r/c/schemas/event/secondary/+/777876 [17:13:11] Can you please advise? [17:13:35] CI doesn't like us changing enums even if they were typos.. do we have to live with the typo for all eternity? [18:19:42] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Automate kerberos credential creation and management to ease the creation of testing infrastructure - https://phabricator.wikimedia.org/T292389 (10odimitrijevic) @elukey @Majavah Following up on this task, is the merging... [18:44:29] 10Data-Engineering, 10Superset: Superset Timeout Logging - https://phabricator.wikimedia.org/T294772 (10odimitrijevic) [18:56:51] 10Analytics, 10Patch-For-Review: Decide whether to migrate from Presto to Trino - https://phabricator.wikimedia.org/T266640 (10odimitrijevic) [18:56:53] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset, 10Epic: Presto/Superset User Experience Improvement - https://phabricator.wikimedia.org/T294259 (10odimitrijevic) [18:57:53] 10Data-Engineering: Upgrade Presto to access UDF library improvements - https://phabricator.wikimedia.org/T295589 (10odimitrijevic) [18:57:55] 10Data-Engineering: Try to improve the LDAP integration for Superset user account creation - https://phabricator.wikimedia.org/T297120 (10odimitrijevic) [18:57:57] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset, 10Epic: Presto/Superset User Experience Improvement - https://phabricator.wikimedia.org/T294259 (10odimitrijevic) [18:57:59] 10Data-Engineering, 10Superset: Superset Timeout Logging - https://phabricator.wikimedia.org/T294772 (10odimitrijevic) [18:58:01] 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10odimitrijevic) [18:59:11] 10Data-Engineering, 10Data-Engineering-Kanban, 10Superset, 10Epic: Presto/Superset User Experience Improvement - https://phabricator.wikimedia.org/T294259 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic Closing epic given that we are focusing on other priorities. There are outstanding tasks whi... [19:01:22] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Requirements - https://phabricator.wikimedia.org/T294258 (10odimitrijevic) Requirements and evaluation have been posted on wikitech: https://wikitech.wikimedia.org/wiki/Data_Catalog_Application_Evaluation/Rubric [19:01:32] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Epic: Data Catalog Requirements - https://phabricator.wikimedia.org/T294258 (10odimitrijevic) 05Open→03Resolved a:03odimitrijevic [19:15:45] 10Data-Engineering, 10Data-Engineering-Kanban, 10Product-Analytics, 10Superset: Help with data that's not appearing on charts - https://phabricator.wikimedia.org/T301895 (10odimitrijevic) a:05BTullis→03None @Iflorez Is this still a problem? [19:16:28] 10Data-Engineering-Radar, 10Data-Services, 10cloud-services-team (Kanban): Upgrade clouddb* hosts to Bullseye - https://phabricator.wikimedia.org/T299480 (10razzi) I'll do this next week. To my knowledge these hosts are pretty much the same as the dbstore hosts I did this week for https://phabricator.wikimed... [19:24:53] 10Data-Engineering-Kanban, 10Airflow, 10GitLab (CI & Job Runners): Allow a shared, protected runner for the data-engineering group in GitLab - https://phabricator.wikimedia.org/T295045 (10BTullis) Pausing this task, since we are not currently working on it. I think that it would still be useful to have a mee... [19:27:11] 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10BTullis) Pausing this task since the database migration has been de-prioritized in favour of other, more pressing... [20:45:58] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Configure LDAP authentication for the DataHub frontend - https://phabricator.wikimedia.org/T301462 (10BTullis) Got the first LDAP enabled login working on the prototype (stat1008) as well as a CR to enable it for the MVP. {F... [22:09:56] 10Data-Engineering, 10Data-Engineering-Kanban, 10Data-Catalog, 10Patch-For-Review: Configure LDAP authentication for the DataHub frontend - https://phabricator.wikimedia.org/T301462 (10BTullis) I have tried really hard to get the following filter to work to restrict access to the nda or wmf groups, but it... [23:14:48] 10Analytics-Radar, 10SRE, 10Traffic-Icebox, 10User-jbond: Fix geoip updaters for new MaxMind hashed keys by 2019-08-15 - https://phabricator.wikimedia.org/T228533 (10Dzahn) @BBlack This sounds like a duplicate of T303464 (and/or /T302864) to me. Maybe you can just merge it.