[06:25:51] ok we have a mess this morning [06:30:14] Hmm joal anything I can help with? I’m still up for a bit [06:30:32] Does it have to do with the change I merged? :X [06:31:17] Hi razzi - The hdfs-cleaner deployed yesterday didn't work as expected and deleted files needed for gobblin to work normally [06:31:32] razzi: nothing you could have found by looking at code - [06:31:47] razzi: the code misbehaved for a reason I don't yet understand [06:32:21] But now we have every single topic pulled from gobblin missing data for hour 02 UTC [06:32:38] I'm gonna devise a fix plan [06:37:10] if you need help I am around :) [06:37:30] Thanks a lot razzi and elukey [06:37:44] I'm gonna devise the plan, and ask you to review [06:39:10] Ok thanks elukey I’ll be back in ~9 hours after much needed sleep! :) [06:39:33] ack razzi - have a good night :) [06:43:44] elukey: if you have a minute - https://etherpad.wikimedia.org/p/analytics-gobblin-mess [06:46:38] before starting - do we need to stop/cleanup the timer that deleted data as precaution [06:46:41] ? [06:47:16] hm [06:47:56] probably yes [06:48:53] that is [06:48:54] elukey@an-launcher1002:~$ sudo systemctl list-timers | grep hdfs-cleaner-gobblin [06:48:57] Thu 2021-10-21 23:45:00 UTC 16h left Wed 2021-10-20 23:45:00 UTC 7h ago hdfs-cleaner-gobblin.timer hdfs-cleaner-gobblin.service [06:49:01] right? [06:49:17] so there is time, but if we don't remove it tomorrow you'll have to do the same mess probably :D [06:49:22] it is yes - it'll not run before tonight - let me send a patch to absent it from puppet [06:49:53] ok for you elukey --^ ? [06:50:43] joal: wait 10 sec I have one ready [06:50:49] ack elukey [06:54:32] https://gerrit.wikimedia.org/r/c/operations/puppet/+/732610 [06:58:55] ok all cleaned up [06:59:17] I am probably not super familiar with Gobblin (and don't have a lot of caffeine flowing yet :D) [06:59:27] but how do you retrieve the deleted state? [06:59:27] ok - I'm not even sure how that thing broke :( I have ideas, but no confirmation [06:59:59] The cleaner doesn't skip trash - the deleted states are in the trash ( i am currently copying them to my folder) [07:00:00] (if it is too long to explain please proceed, I don't want to derail/slowdown) [07:00:07] ahhhh good [07:00:11] yes :) [07:01:16] so you want to test the gobblin jobs with the deleted states in your dir, verify that all is sound, run the jobs and transfer the data to prod correctly [07:01:22] so that webrequests etc.. will be unblocked [07:02:12] Almost - I wish to test jobs from states in my folder - correct [07:02:32] Then I wish to run the tested jobs to prod-destination (once asserted correct) - yes [07:03:13] finally we'll have to manually rerun a bunch of jobs - for webrequest it's easy, they failed on missing data, so rerun should do - for events we to -re-refine [07:03:29] +1 then looks sound [07:03:43] ack elukey thanks [07:04:00] Will proceed gently, trying not to over-mess :) [07:04:19] do you need a rubber duck while you do it? Or do you prefer to do it on your own? [07:04:54] I can always do with some of your help :) But I know you have other things to do :) [07:25:09] here I am sorry [07:25:18] I can join if needed :) [07:26:16] thanks elukey - I'll ping you if I feel alone ;) [07:26:44] ack [08:34:49] I am also here now, in case I can help. [08:36:17] ack btullis thanks a lot [08:36:50] I'm moving gently - I have managed to have a working webrequest job - will check data somehow and will then unlock the jobs by copying data [08:37:19] ack [08:41:08] !log Rerun webrequest-load jobs for hour 2021-10-21T02:00 [08:41:11] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [08:44:43] I didn't know that HDFS trash was a thing until now, but I've just read: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster#recover_files_deleted_by_mistake_using_the_hdfs_CLI_rm_command? [10:09:51] (03Abandoned) 10Jhernandez: POC: Using a to show the dbs [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/668544 (owner: 10Jhernandez) [10:27:20] Spark 3.2 is out - and there some super cool improvements :) [10:35:41] !log Re-refine netflow data after gobblin pulled data fix [10:35:44] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [10:44:30] joal: I want to get turnilo and superset permissions for ejoseph - is there some kind of a ticket template/process for that? I forgot how it looked for me [10:45:18] zpapierski: IIRC he needs a LDAP acouunt, and the analytics-privatedata-user group [10:45:58] LDAP he has, so I understand I need to get him analytics-privatedata-user group - do you remember how it's done? [10:47:09] zpapierski: Needs to be done throuhg a ticket to SRE IIRC - docs are here: https://wikitech.wikimedia.org/wiki/Analytics/Data_access [10:47:20] ah, perfect - thanks [11:26:22] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [11:51:52] 10Analytics, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Sending events to `maps.tiles_change` stream is failing - https://phabricator.wikimedia.org/T294011 (10Jgiannelos) [11:52:41] 10Analytics, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Sending events to `maps.tiles_change` stream is failing - https://phabricator.wikimedia.org/T294011 (10Jgiannelos) [12:39:25] (03CR) 10DCausse: Spark JsonSchemaConverter - additionalProperties with schema is always a MapType (031 comment) [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/629406 (https://phabricator.wikimedia.org/T263466) (owner: 10Ottomata) [12:51:46] (03PS9) 10DCausse: Add fragment/mediawiki/revision/slot [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) [12:52:26] (03CR) 10jerkins-bot: [V: 04-1] Add fragment/mediawiki/revision/slot [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) (owner: 10DCausse) [12:55:00] (03PS10) 10DCausse: Add fragment/mediawiki/revision/slot [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) [13:00:19] (03CR) 10DCausse: Add fragment/mediawiki/revision/slot (031 comment) [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) (owner: 10DCausse) [13:17:06] PROBLEM - Check unit status of gobblin-event_default on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit gobblin-event_default https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:19:13] joal: anything that I can do to help with this or the previousl gobblin issue? --^ [13:19:38] o/ [13:19:47] joal just saw your note, still checking email, etc. let me know if i can help too! [13:19:52] hm [13:20:13] Ok I think I know what have happened [13:20:17] batcave? [13:21:12] sure [13:39:08] 10Analytics-Radar, 10Fundraising-Backlog, 10Product-Analytics, 10Wikipedia-iOS-App-Backlog, and 2 others: Understand impact of Apple's Relay Service - https://phabricator.wikimedia.org/T289795 (10TheDJ) >>! In T289795#7382434, @GeneralNotability wrote: > a list of current egress points can be found at http... [13:39:31] !log btullis@an-launcher1002:~$ sudo systemctl restart gobblin-event_default [13:39:33] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [13:50:14] RECOVERY - Check unit status of gobblin-event_default on an-launcher1002 is OK: OK: Status of the systemd unit gobblin-event_default https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [13:50:20] 10Analytics, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Sending events to `maps.tiles_change` stream is failing - https://phabricator.wikimedia.org/T294011 (10Ottomata) Ah right, sorry. eventgate-main only requests stream config on startup, so we need to just restart the service...will do... [13:54:51] 10Analytics, 10Analytics-Kanban: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) [13:59:04] 10Analytics, 10Analytics-Kanban: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) This request has now been granted. We have a new project on WMCS named `data-engineering`. I have added @Ottomata and @razzi as project admins. @elukey was also an ad... [14:00:20] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:05:33] !log rerun refine_eventlogging_analytics refine_eventlogging_legacy and refine_event with -ignore-done-flag=true --since=2021-10-21T01:00:00 --until=2021-10-21T04:00:00 for backfill of missing data after gobblin problems [14:05:35] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [14:11:53] I'm a little confused. I'm not sure how I'd set HADOOP_HEAPSIZE when using a library like PyHive. There doesn't seem to be a corresponding configuration option (https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties), but would I just set it as an environment variable and when a Hive client is started by PyHive it would just use that? [14:12:44] Hello, is there a good way to get ORES articletopic scores for all articles on a given wiki? I'm thinking about "for each article for the wiki that's in event_sanitized.mediawiki_revision_score, keep only the row with highest rev_id", but perhaps there's already a way to get exactly that kind of information? [14:12:58] (obviously, i could query ores.wikimedia.org directly, but for all articles, that'd...take a while) [14:13:24] PROBLEM - Check unit status of produce_canary_events on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:17:45] urbanecm: good question, I don't know of a dump of ORES scores anywhere, elukey is there such a thing? If there is I'll add a line about it at https://wikitech.wikimedia.org/wiki/ORES [14:18:37] as far as I know there is no such thing [14:18:45] there's https://analytics.wikimedia.org/published/datasets/one-off/ores/scores_dumps/damaging_goodfaith_enwiki/, but I have no idea who and how created that :) [14:19:03] (and it's different set of scores, although same table i guess) [14:19:30] that one was requested as one off in a task, but it should contain the scores for the changes done in a certain period of time [14:19:39] with changes I meant edit [14:19:41] *edits [14:20:18] since change prop asks a score for every edits of all wikis, storing those in kafka [14:25:38] 10Analytics: [Airflow] Implement DAG that syncs archiva packages to HDFS - https://phabricator.wikimedia.org/T294024 (10mforns) [14:25:54] 10Analytics: [Airflow] Implement DAG that syncs archiva packages to HDFS - https://phabricator.wikimedia.org/T294024 (10mforns) [14:25:56] 10Analytics, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10mforns) [14:29:07] 10Analytics: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10mforns) [14:29:20] 10Analytics: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10mforns) [14:29:22] 10Analytics, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10mforns) [14:35:11] 10Analytics, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Sending events to `maps.tiles_change` stream is failing - https://phabricator.wikimedia.org/T294011 (10Jgiannelos) 05Open→03Resolved Thanks, looks like its working now. I got some canary events and I also publish a couple of test events... [14:35:28] RECOVERY - Check unit status of produce_canary_events on an-launcher1002 is OK: OK: Status of the systemd unit produce_canary_events https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [14:36:08] 10Analytics, 10Maps, 10Product-Infrastructure-Team-Backlog (Kanban): Sending events to `maps.tiles_change` stream is failing - https://phabricator.wikimedia.org/T294011 (10Ottomata) Done! [14:59:53] (03CR) 10Ppchelko: [C: 04-1] "last nitpick." [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/731006 (https://phabricator.wikimedia.org/T293195) (owner: 10DCausse) [15:02:35] Hi team, good morning from the MIDWEST USA!! [15:02:39] (crowd cheers) [15:03:11] so urbanecm I guess we should open up a task to create such a dump maybe, querying the API for *all* articles doesn't seem like a great idea [15:03:16] morning razzi, lol [15:03:45] razzi: you wanna help me troubleshoot something? [15:03:57] I'm down [15:04:03] ok, to the batcave! [15:04:04] batcave? [15:04:05] !!! [15:04:14] My camera isn't working, gonna reboot real quick [15:04:16] batcave's busy. to the TARDIS! [15:09:26] (03CR) 10MNeisler: [C: 03+1] talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [15:12:47] joal: all 3 refine reruns succeeded [15:17:08] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Snapshot and Reload cassandra2 pageview_per_article data table from all 12 instances - https://phabricator.wikimedia.org/T291472 (10BTullis) There are still compactions running on the new cluster, although they have almost completed. We have decided to wai... [15:20:42] ottomata: we need a pckage on debian [15:20:47] on stat boxes [15:20:52] libsasl [15:21:53] razzi: https://packages.debian.org/buster/libsasl2-2 ? [15:22:16] hmm [15:22:17] razzi: if just stat boxes [15:22:17] profile::analytics::cluster::packages::statistics [15:22:20] add to that [15:22:21] unclear, apparently that's already got it [15:23:26] we need libsasl2-dev [15:23:28] ottomata: [15:23:33] for the headers [15:23:37] aye [15:23:41] ok yeah I'll add to that [15:23:44] coo [15:23:57] Also installed, at least on stat1004 [15:24:00] https://www.irccloud.com/pastebin/wIzwX3GF/ [15:24:30] btullis: yep, we need dev headers [15:24:36] libsasl2-dev [15:24:44] Oops, sorry, I missed the -dev on my command. Apologies [15:25:05] hm, so why does pip install fail with: [15:25:08] https://www.irccloud.com/pastebin/EUZymed8/ [15:25:14] Strangely though I see ensure_packages libsasl2-dev [15:25:17] # For pyhive [15:25:18] 'libsasl2-dev', [15:25:19] Ah, no I didn't, I just pasted the wrong thing. [15:25:22] https://www.irccloud.com/pastebin/B2DmeHI1/ [15:25:35] yeah idk why it isn't finding the header [15:25:43] right... so why's it failing to pip install sasl or pip install pyhive[hive] [15:25:48] razzi: you are in anaconda [15:25:58] aha! [15:26:14] there is all the mess outlined in https://phabricator.wikimedia.org/T292699 [15:26:31] I think that you'd need to install libsasl via conda [15:26:38] woah [15:27:40] mmm should be `conda install -c conda-forge cyrus-sasl ` [15:27:42] razzi: --^ [15:28:19] or not, I don't see the headers in the webpage related to it, maybe there is another [15:28:49] well let's try it first :) [15:29:02] elukey: what's the deal with cyrus? I never heard of it [15:29:25] it is an implementation of sasl IIRC [15:29:30] cool, trying [15:29:59] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) I have confirmed the presence of the first rule that has been added, by using an SSH tunnel and checkin... [15:30:00] ah wait it was milimetric asking, sorry :D [15:30:41] in theory though simply installing will not work miriam [15:30:43] err milimetric [15:31:02] right, it didn't :) [15:31:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review, and 2 others: Migrate analytics cluster alerts from Icinga to AlertManager - https://phabricator.wikimedia.org/T293399 (10BTullis) [15:31:15] I think you'd also need export CPPFLAGS="${CPPFLAGS} -isystem ${CONDA_PREFIX}/include" [15:32:07] worked elukey, thank you! [15:32:15] I'm gonna add a little section to the docs and point to your task [15:32:26] Who's Miriam? 🙂 [15:33:43] milimetric: nice! In theory when we package the new anaconda-wmf deb we'll get rid of the extra export CPP... [15:34:44] btullis: Miriam Redi! (ciao Miriam!) [15:35:33] ciao ciao elukey and btullis!! [15:35:43] k, added this: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Anaconda#Installing_packages_into_your_user_conda_environment [15:36:08] Hello. Nice to meet you Miriam. [16:00:15] 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10JAllemandou) Moving back to incoming as there is demand from @cchen to prioritize. [16:00:43] (03CR) 10Ottomata: talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [16:03:32] mforns: standup? [16:03:35] razzi: standuo? [16:03:42] uop! [16:05:57] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata) a:03Ottomata [16:24:15] 10Analytics, 10Analytics-Kanban: Check home/HDFS leftovers of mholloway-shell - https://phabricator.wikimedia.org/T291353 (10odimitrijevic) @jlinehan @dr0ptp4kt putting this on your radar again. [16:30:24] joal: In answer to your questionm yes the server that appears to be very slow compacting *is* one of the two that was restarted recently. [16:30:27] https://www.irccloud.com/pastebin/8kJoUR5w/ [16:34:04] hm [16:34:42] btullis: would you mind triple checking this server's log? I wonder if it could be blocked or anything like the other one [16:37:37] It's still producing log messages, but slowly. No recent stack traces. [16:37:48] ack btullis thanks [16:47:58] (03CR) 10DLynch: talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [16:50:02] (03CR) 10DLynch: talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [17:01:12] (03CR) 10DLynch: talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [17:08:57] 10Analytics, 10Product-Analytics, 10Structured-Data-Backlog: Create a Commons equivalent of the wikidata_entity table in the Data Lake - https://phabricator.wikimedia.org/T258834 (10JAllemandou) p:05Medium→03Triage [17:09:22] (03PS5) 10DLynch: talk_page_event schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) [17:12:23] (03CR) 10Ottomata: talk_page_event schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/731333 (https://phabricator.wikimedia.org/T286076) (owner: 10DLynch) [17:14:38] 10Analytics, 10Analytics-Kanban: HDFS check topology alert is currently broken - https://phabricator.wikimedia.org/T292846 (10BTullis) 05Open→03Resolved [17:15:39] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Automate kerberos credential creation and management to ease the creation of testing infrastructure - https://phabricator.wikimedia.org/T292389 (10BTullis) [17:16:16] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review: Standardize the stats system user uid - https://phabricator.wikimedia.org/T291384 (10Ottomata) [17:16:29] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Data-Engineering, and 2 others: Decommission EventLogging backend components by migrating to MEP - https://phabricator.wikimedia.org/T238230 (10Ottomata) [17:16:34] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Data-Engineering, and 3 others: Migrate legacy metawiki schemas to Event Platform - https://phabricator.wikimedia.org/T259163 (10Ottomata) [17:16:41] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to multi instance with failover - https://phabricator.wikimedia.org/T284150 (10Ottomata) [17:16:48] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Improve Refine bad data handling - https://phabricator.wikimedia.org/T289003 (10Ottomata) [17:16:50] 10Analytics, 10Data-Engineering: Create aggregate alarms for Hadoop daemons running on worker nodes - https://phabricator.wikimedia.org/T287027 (10BTullis) [17:17:03] 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, 10Data-Engineering, and 3 others: Determine which remaining legacy EventLogging schemas need to be migrated or decommissioned - https://phabricator.wikimedia.org/T282131 (10Ottomata) [17:17:31] 10Analytics, 10Analytics-Kanban: hdfs directory for analytics-research - https://phabricator.wikimedia.org/T290918 (10Ottomata) 05Open→03Resolved [17:17:33] 10Analytics, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10Ottomata) [17:17:50] 10Analytics, 10Data-Engineering: Create aggregate alarms for Hadoop daemons running on worker nodes - https://phabricator.wikimedia.org/T287027 (10BTullis) This will be done as part of {T293399} [17:18:05] 10Analytics, 10Platform Team Workboards (Image Suggestion API): Airflow collaborations - https://phabricator.wikimedia.org/T282033 (10Ottomata) [17:18:07] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: SPIKE - Will Hadoop 3 container support help us for Airflow deployment pipelines? - https://phabricator.wikimedia.org/T288247 (10Ottomata) 05Open→03Resolved [17:18:32] 10Analytics: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10razzi) My personal opinion is that it should be called airflow-config so that the file directory structure will look like `airflow-config/dags/my_cool_dag` etc. [17:18:35] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Move the Analytics/DE testing infrastructure to Pontoon - https://phabricator.wikimedia.org/T292388 (10BTullis) [17:18:43] 10Analytics, 10Analytics-Kanban, 10Discovery-Search, 10Patch-For-Review: Publish both shaded and unshaded artifacts from analytics refinery - https://phabricator.wikimedia.org/T217967 (10Ottomata) 05Open→03Resolved [17:19:01] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: [Airflow] Create repository for Airflow DAGs - https://phabricator.wikimedia.org/T294026 (10odimitrijevic) p:05Triage→03High a:03mforns [17:19:31] 10Analytics: [Airflow] Implement DAG that syncs archiva packages to HDFS - https://phabricator.wikimedia.org/T294024 (10odimitrijevic) p:05Triage→03High [17:20:04] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: High volume mediawiki analytics events camus import is lagging - https://phabricator.wikimedia.org/T233718 (10Ottomata) [17:20:12] 10Analytics: We should get an alarm for partitions that have no data for topics that have data influx at all times, most of the mediawiki.* - https://phabricator.wikimedia.org/T250699 (10Ottomata) 05Open→03Resolved canary events + monitoring exist. [17:20:15] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform, 10serviceops: Enable envoy tls proxy logging from eventgate - https://phabricator.wikimedia.org/T291856 (10Ottomata) 05Open→03Resolved [17:20:42] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Refine drops $schema field values - https://phabricator.wikimedia.org/T255818 (10Ottomata) [17:21:27] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform, 10Platform Team Initiatives (Modern Event Platform (TEC2)): Allow disabling/enabling configured streams via wgEventStreams config - https://phabricator.wikimedia.org/T259712 (10Ottomata) [17:22:12] 10Analytics, 10Analytics-SWAP: Users should be able to read their jupyter instance logs - https://phabricator.wikimedia.org/T198764 (10Ottomata) [17:22:17] 10Analytics, 10Analytics-Kanban: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou) [17:22:19] 10Analytics, 10Analytics-Kanban, 10Patch-For-Review: Jupyter notebook logs should appear in Logstash - https://phabricator.wikimedia.org/T288348 (10Ottomata) [17:22:30] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Repair and reload all cassandra-2 data tables but the 2 big ones - https://phabricator.wikimedia.org/T291469 (10JAllemandou) 05In progress→03Resolved Resolving! [17:22:59] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Check AQS with cassandra (serving + data) - https://phabricator.wikimedia.org/T290068 (10JAllemandou) [17:23:21] 10Analytics-Kanban: Analytics Hardware for Fiscal Year 2020/2021 - https://phabricator.wikimedia.org/T255145 (10Ottomata) [17:23:24] 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: Set up an-web1001 and decommission thorium - https://phabricator.wikimedia.org/T285355 (10Ottomata) 05Open→03Resolved [17:23:45] 10Analytics, 10Analytics-Kanban, 10Event-Platform, 10Metrics-Platform, and 2 others: wgEventStreams (EventStreamConfig) should support per wiki overrides - https://phabricator.wikimedia.org/T277193 (10Ottomata) 05Open→03Resolved [17:25:30] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Patch-For-Review: Users should run explicit commands to materialize schema versions, rather than using magic git hooks - https://phabricator.wikimedia.org/T290074 (10Ottomata) [17:25:51] 10Analytics, 10SRE, 10SRE Observability (FY2021/2022-Q2): statsd and gunicorn metrics for superset - https://phabricator.wikimedia.org/T293761 (10odimitrijevic) p:05Triage→03Medium [17:26:56] 10Analytics, 10Data-Engineering: Make it possible to use anaconda + stacked conda envs for Airflow executors - https://phabricator.wikimedia.org/T288271 (10Ottomata) [17:26:58] 10Analytics: [Airflow] Implement DAG that syncs archiva packages to HDFS - https://phabricator.wikimedia.org/T294024 (10Ottomata) [17:28:14] 10Analytics, 10Product-Analytics, 10wmfdata-python: Upstream relevant parts of wmfdata-python into refinery - https://phabricator.wikimedia.org/T293700 (10Ottomata) Related: {T286743} [17:28:24] 10Analytics: Use corosync and pacemaker for presto coordinator active/standby configuration - https://phabricator.wikimedia.org/T287967 (10BTullis) [17:28:30] joal found the timestamp kafka thing: [17:28:31] https://phabricator.wikimedia.org/T282887 [17:33:19] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Write document about making Superset fast enough - https://phabricator.wikimedia.org/T294046 (10JAllemandou) [17:34:09] ottomata: We should try to make this happen --^ ! [17:34:12] 10Analytics, 10Analytics-Kanban, 10Product-Analytics, 10wmfdata-python: wmfdata-python's Hive query output includes logspam - https://phabricator.wikimedia.org/T275233 (10Milimetric) https://github.com/wikimedia/wmfdata-python/pull/23 [17:36:47] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Write document about making Superset fast enough - https://phabricator.wikimedia.org/T294046 (10JAllemandou) a:03JAllemandou [17:39:38] 10Analytics: Reduce superset timeouts problem - https://phabricator.wikimedia.org/T294048 (10JAllemandou) [17:40:18] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Write document about making Superset fast enough - https://phabricator.wikimedia.org/T294046 (10JAllemandou) [17:40:21] 10Analytics: Reduce superset timeouts problem - https://phabricator.wikimedia.org/T294048 (10JAllemandou) [17:41:01] 10Analytics: Reduce superset timeouts problem - https://phabricator.wikimedia.org/T294048 (10JAllemandou) p:05Triage→03High [17:42:57] 10Analytics, 10Analytics-Kanban: Purge gobblin files - https://phabricator.wikimedia.org/T287084 (10JAllemandou) Back to "In Progress" to assess whether the deletion script is stable enough and doesn't break Gobblin on a regular basis. [17:45:47] 10Analytics: Fix gobblin not writing _IMPORTED flags when runs don't overlap hours - https://phabricator.wikimedia.org/T286343 (10JAllemandou) a:05JAllemandou→03None [17:58:48] 10Analytics, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) a:03Jclark-ctr [19:16:49] (03PS1) 10Ottomata: Fix maps.tiles_change schema required field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/732767 (https://phabricator.wikimedia.org/T293366) [19:17:58] (03CR) 10Ottomata: [C: 03+2] Fix maps.tiles_change schema required field [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/732767 (https://phabricator.wikimedia.org/T293366) (owner: 10Ottomata) [19:20:43] 10Analytics, 10Discovery, 10Event-Platform, 10SRE, 10Platform Team Workboards (Clinic Duty Team): Avoid accepting Kafka messages with whacky timestamps - https://phabricator.wikimedia.org/T282887 (10Ottomata) This happened today, somehow there were recentchange events with timestamps from around 2007 in... [22:48:36] 10Analytics, 10Analytics-Kanban, 10Data-Engineering: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10razzi) Turns out Presto has a built-in query log in their internal table `system.runtime.queries`: {F34705387} Example usage (remember to kinit first): `razzi@stat1005:/srv/home/ra...