[00:01:42] (03PS3) 10Jdlrobson: Add `init` as a valid enum for action field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) [00:01:59] (03CR) 10Jdlrobson: "New patch up." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [00:17:42] 10Analytics-Radar, 10Product-Analytics, 10Editing-team (Tracking): How often do people try to edit on mobile devices, using the desktop site, at the English Wikipedia? - https://phabricator.wikimedia.org/T288972 (10ppelberg) [00:17:52] 10Analytics-Data-Quality, 10Analytics-EventLogging, 10Analytics-Radar, 10Editing-team, and 3 others: WikiEditor records all edits as platform = desktop in EventLogging - https://phabricator.wikimedia.org/T249944 (10ppelberg) [00:58:33] (03PS1) 10Jdlrobson: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) [00:59:06] (03CR) 10jerkins-bot: [V: 04-1] Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [01:14:25] (03PS2) 10Jdlrobson: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) [01:14:59] (03CR) 10Jdlrobson: "The docs on https://wikitech.wikimedia.org/wiki/Event_Platform/Instrumentation_How_To#Evolving were super helpful." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [02:10:40] (03CR) 10Jdlrobson: "Who is responsible for merging this one?" [schemas/event/primary] - 10https://gerrit.wikimedia.org/r/700244 (owner: 10Gergő Tisza) [03:32:06] 10Analytics: Superset annotation text overlaps illegibly - https://phabricator.wikimedia.org/T279738 (10razzi) Ah sorry @nettrom_WMF for the premature close; I forgot that annotations were something you had to add; I was confusing them with the hover as you look at a chart. Thanks for the examples [04:30:30] PROBLEM - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is CRITICAL: CRITICAL: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [07:23:15] !log `apt-get clean` on stat1006 to free some space (root partition full) [07:23:18] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [07:33:08] ah nice /tmp is ~60G [07:33:57] 11G /tmp/blockmgr-618da692-6ff1-4485-8638-3aa4cb9cdd61 [07:33:57] 37G /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 [07:34:10] elukey@stat1006:/$ ls -ld /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 [07:34:13] drwxr-xr-x 66 iflorez wikidev 4096 Nov 8 19:44 /tmp/blockmgr-20fe4b2b-31fb-4a85-b5b1-bebe254120f8 [07:34:20] probably spark related? [07:35:30] it seems the spark.local.dir [07:35:54] maybe on stat100x the option could be moved to /srv/spark-tmp? [07:43:33] created a task [07:47:37] I have also created https://gerrit.wikimedia.org/r/c/labs/tools/wikibugs2/+/737611 to instruct wikibugs to display Data-Engineering-related tags in here [08:06:40] Good morning elukey - thanks a lot for this --^ <3 [08:08:46] bonjour :) [08:10:27] elukey: Irene sent a question to slack yesterday about a query timing out - I'm assuming the spark-tmp space issue is related [08:13:21] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10WMDE-leszek) oh dear, your favorite facepalm meme could go here. Thanks @awight for pointing o... [09:10:16] (03Abandoned) 10David Caro: tox: Add python to the allowlist_externals [analytics/quarry/web] - 10https://gerrit.wikimedia.org/r/711134 (owner: 10David Caro) [09:17:05] 10Analytics: Add user accounts to LDAP group `analytics-privatedata-users` - https://phabricator.wikimedia.org/T295352 (10kai.nissen) [09:35:04] mforns: it looks like the delay didn't work, maybe the sanitize job takes longer than 2 hours? cc btullis [09:35:37] 10Analytics: Add user accounts to LDAP group `analytics-privatedata-users` - https://phabricator.wikimedia.org/T295352 (10kai.nissen) [09:35:45] milimetric: Yes, I wondered that. Thanks. Looking now. [09:36:29] btullis: let me know if you wish to brainstorm [09:39:12] joal: Thanks. I probably will. Still working through the bits of the puzzle for a bit first. [10:00:30] I need to make sense of these `since` and `until` values. [10:00:35] https://www.irccloud.com/pastebin/l6kxmdoY/ [10:02:29] btullis: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/refine/Refine.scala#L168 [10:02:34] btullis: they are a number of hours [10:02:44] The `refine_event_sanitized_analytics_delayed` job says start refining refine 46 days ago and stop 45 days ago. [10:03:30] The `monitor_refine_event_sanitized_analytics_delayed` job says monitor from 47.08 days ago to 45 days ago. [10:04:45] 1) Why only this small window and 2) Why the discrepancy between the two? [10:06:24] btullis: 2) probably to cover for time passed between jobs? Not sure though [10:06:44] btullis: 1) why small for the "monitor" you mean? [10:09:45] joal: Thanks, what I mean is: why does either job only consider refining and checking a small window of only a day or so, around 45 days ago? What happens to the data at age 45 days? [10:13:04] OK, I can see this from here: https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Backfilling#Backfilling_sanitization [10:13:04] > Note that EL sanitization does already a second pass 45 days after data collection. So if the data that you want to backfill is not older than 45 days, you don't need to backfill it (will be done automatically after 45 days), unless it's urgent! [10:13:20] btullis: I'm pasting the lines I found in puppet [10:14:20] btullis: https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/analytics/refinery/job/refine_sanitize.pp#L86 [10:15:18] btullis: the default sanitization job happens just after the refine job, and then a second pass is made 45 days after, to cover for possible changes in upstream data (and also backfilling newly sanitized old schemas automatically) [10:20:42] OK, thanks. This is coming together now, slowly. So this would explain explain why for the last 4 days we have been having the same pattern of hours showing as un-refined, but on consecutive days. The 2.08 day window is sliding forward each time. [10:21:15] Tomorrow we would expect to receive an email with the first line (not the first chronologically, but the first lexically) of: [10:21:22] ` `event_sanitized`.`searchsatisfaction` /wmf/data/event_sanitized/searchsatisfaction/year=2021/month=9/day=25/hour=10` [10:21:36] something similar to that btullis indeed --^ [10:22:45] and our problem is a misalignment between the refine_delayed job happening every day and backfilling data from a newly sanitized old schema, and the refine_monitor_delayed, checking that the data has been refined [10:24:18] Indeed it is not present, but some hours are present for that day. They were refined at around 06:32 today. [10:24:23] https://www.irccloud.com/pastebin/xKzG0V1c/ [10:25:23] btullis: lets batcave for simplicity? [10:25:39] Yes. See you there. [10:27:01] btullis; I have updated https://gerrit.wikimedia.org/r/c/analytics/wikistats2/+/737386 [11:46:04] CristianCantoro: Thanks. Actually, I don't yet have +2 rights on that repo. I think I need to get someone to add me to the Analytics group in gerrit. 🙂 [11:54:38] This is the change that joal and I think is going to stop the failure of the delayed jobs: https://gerrit.wikimedia.org/r/737650 [11:55:47] folks I am moving kafka-test to the settings pre-pki rollout (so new truststore, new admins, etc..) [11:55:59] In summary, the monitor jobs were running *before* the refine jobs, dince they took a default value for `monitor_interval`. instead of taking an explicit value. [11:56:31] elukey: ack - Thanks and keep us posted. Really keen to see how this works out. [11:57:42] > stop the failure of the delayed job [11:57:42] This should read: [11:57:42] > stop the failure email from the delayed sanitize job monitor [12:23:26] !log btullis@an-launcher1002:~$ sudo systemctl reset-failed monitor_refine_event_sanitized_analytics_delayed [12:23:28] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [12:25:28] kafka test running with the new truststore! [12:25:34] RECOVERY - Check unit status of monitor_refine_event_sanitized_analytics_delayed on an-launcher1002 is OK: OK: Status of the systemd unit monitor_refine_event_sanitized_analytics_delayed https://wikitech.wikimedia.org/wiki/Analytics/Systems/Managing_systemd_timers [12:25:38] I filed a similar change for mirror maker [12:52:00] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10daniel) After some discussion with Tim, it seems to me like the most realistic way to do this... [13:05:01] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Ladsgroup) My problem is that we don't know why we want a solution in this direction. The unde... [13:56:45] can someone with root access on stat1007 look at the journal for wmde-analytics-daily-early.service and tell me if there are any errors in the recent runs? [13:56:55] I don’t have permission to access the journal and https://wikitech.wikimedia.org/wiki/WMDE/Analytics#Logs says I should ask in here :) [13:59:52] milimetric, btullis :C I don't think the sanitization job takes more than 2 hours, because otherwise, it would accumulate... it runs every hour. [14:06:34] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Ottomata) Re CDC idea, pros and cons, see {T120242}, let's keep that discussion there. @danie... [14:08:32] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Ottomata) @WMDE-leszek, if we can rely on event streams to source MW data for non-MW services.... [14:15:24] (03CR) 10Ottomata: "Nice! glad it helped." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [14:16:55] (03CR) 10Ottomata: [C: 03+1] "Might be nice to add a changelog.md or something to document what changed between versions." [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [14:22:43] mforns: Thanks. The `event_sanitize_[analytics|main]_immediate` jobs run every hour. [14:23:21] The `event_sanitize_[analytics|main]_delayed` jobs run once per day. [14:24:38] not sure if Lucas._WMDE's logs was served elsewhere, but if not, maybe btullis would be able to help? 🙂 [14:25:06] The main issue seems to be not in fact how long the `*_refine_delayed` jobs took to run, but that the monitor jobs associated were running *before* the associated job. [14:25:30] Lucas_WMDE: urbanecm: Yes, I will. Hang on a sec... [14:25:48] sure, just didn't want that to be lost in scrollback :)) [14:28:22] o/ [14:28:34] Lucas_WMDE: `PHP Fatal error: Maximum execution time of 3600 seconds exceeded in /srv/analytics-wmde/graphite/src/scripts/lib/WikimediaDb.php on line 68` [14:28:41] :/ [14:28:49] is there a traceback? [14:29:44] Interestingly, it doesn't show up with a failure code from systemctl, so it doesn't show up as a failure in Icinga. [14:31:27] yeah, those scripts are pretty hacky and I don’t think we propagate the exit code [14:33:34] is that the only error? [14:34:00] Lucas_WMDE: https://phabricator.wikimedia.org/P17712 Nothing confidential is there? [14:34:21] probably not, checking [14:34:59] looks good, thanks [14:35:14] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10Ottomata) [14:35:53] I can get you older logs as well, if you like, or we could set up something to write this to logstash or similar if you think you'd like long-term access. [14:39:19] ottomata: Do you want to catch up on the DB stuff, since I missed quite a bit yesterday? [14:40:21] btullis: sure [14:40:22] bc? [14:40:25] I was wondering about using `pt-table-checksum` and `pt-table-sync` to try to bring an-coord1002 back in-line with an-coord1001. https://www.percona.com/blog/2015/08/12/mysql-replication-primer-with-pt-table-checksum-and-pt-table-sync/ [14:40:32] See you there. [14:41:29] btullis: thanks, I think that’s enough for now [14:42:04] can analytics-privatedata-users access the journal? (several WMDE colleagues seem to be in that group already) [14:46:09] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Patch-For-Review: Refactor analytics-meta MariaDB layout to use an-db100[12] - https://phabricator.wikimedia.org/T284150 (10Ottomata) a:05Ottomata→03BTullis [14:46:19] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10Ottomata) a:05Ottomata→03BTullis [14:47:52] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10Ottomata) p:05Triage→03High [14:51:46] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10Ottomata) @jcrespo Oh, perhaps this is a question for you? [14:51:49] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Results have expired error in Hue - https://phabricator.wikimedia.org/T294144 (10BTullis) @EYener - Superset is really the //state of the art// for us at the moment in terms of a distributed SQL UI and is where the majority of our efforts will be in... [14:52:23] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Results have expired error in Hue - https://phabricator.wikimedia.org/T294144 (10BTullis) p:05Triage→03Medium [14:54:05] joal: o/ [14:54:07] how goes?! [14:54:32] Lucas_WMDE: i think not, but we should be able to make the journal go to a log file that could be accessed [14:54:52] not sure that’s worth the effort right now [14:54:58] we should probably rethink how those scripts run anyways [14:55:12] e.g. so that they also actually send an exit code to systemd if they fail [14:55:42] one unit per script would probably cleanest, instead of the timer just starting a shell script that runs script1 & script2 & script3 & etc & wait [14:56:28] https://serverfault.com/a/814913 indicates just setting the storage to persistent should do the trick? Dunno if that's true, but if it is, it might actually be not that hard. [14:57:29] that should solve the logs getting lost on reboot, if that’s a problem [14:57:29] hm maybe! [14:58:46] (other solution might be running script2 &> /some/dir/wmde/can/access/script2.log and manually logrotating, but the serverfault answer looks to be easier [at least without actually trying it]) [14:58:53] Hi ottomata - leaving for kids in minutes - mostly meetings today :S [14:59:00] ok, lemme know if i can hellp wiht gobblin stuff [14:59:31] sure ottomata - probably tomorrow given I spend time with razzi later tonight [14:59:46] kay! [15:01:59] (03CR) 10Phuedx: [C: 03+2] Add `init` as a valid enum for action field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [15:02:43] (03Merged) 10jenkins-bot: Add `init` as a valid enum for action field [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [15:04:52] (03CR) 10Phuedx: Add `init` as a valid enum for action field (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [15:32:08] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban: Results have expired error in Hue - https://phabricator.wikimedia.org/T294144 (10EYener) Thanks @BTullis - we do use Superset's SQL Lab, which is okay at showing a schema once you know the dataset you want to query, but not all we would like a SQL U... [15:47:25] (03PS2) 10Ottomata: Refine - don't remove records during deduplication if ids are null [analytics/refinery/source] - 10https://gerrit.wikimedia.org/r/735444 (https://phabricator.wikimedia.org/T294361) [15:50:13] (03CR) 10Ottomata: "Perhaps!" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/736933 (https://phabricator.wikimedia.org/T294738) (owner: 10Jdlrobson) [15:51:25] joal: i updated https://gerrit.wikimedia.org/r/c/analytics/refinery/source/+/735444, it changes the behavior but i think it is more correct and also much simpler [15:56:57] btullis: down to pair in a few? [15:57:38] Yes, give me five minutes, then I'm there... [16:09:54] razzi: I'm in the hangout, whenever you're ready. [16:10:13] cool, brt [16:30:06] !log set superset presto timeout to 170s: {"connect_args":{"session_props":{"query_max_run_time":"170s"}}} for T294771 [16:30:09] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:30:10] T294771: Increase Superset Timeout - https://phabricator.wikimedia.org/T294771 [16:30:22] !log set superset presto version to 0.246 in ui [16:30:24] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:34:14] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Cparle) The replies above suggest that this ticket is indeed about events as **the** source of... [16:34:24] ottomata: actually, I have to create a task for that! [16:38:25] ottomata: https://phabricator.wikimedia.org/T295380 [16:39:51] thanks mforns [16:52:16] mforns: what from the airflow-dags repo needs to be in the airflow PYTHONPATH? [16:52:17] !log restart presto server on an-coord1001 to apply change for T292087 [16:52:19] Logged the message at https://www.mediawiki.org/wiki/Analytics/Server_Admin_Log [16:52:20] T292087: Setup Presto UI in production - https://phabricator.wikimedia.org/T292087 [16:52:20] def dags_folder [16:52:28] that will happen [16:52:30] yes [16:52:31] but, for common stuff? [16:52:39] will it be just a single folder? [16:52:58] there are several ways we could structure this [16:53:02] also, we'll put custom operators and dag templates within a shared folder at the top level of the repo [16:53:11] yes [16:53:15] like 'common' or something? [16:53:19] or wmf-common [16:53:23] wmf_common [16:53:24] i dunno [16:53:26] and I imagine, each team will want to have their own space for that too [16:53:29] something like that [16:53:31] hmm [16:53:40] airflow will need to know about it, we could use the plugins_folder [16:53:52] i was considering making symlinks from the airflow base dir into the deployment target [16:54:04] as razzi mentioned it would be cool that the common library is named in a way that imports look intuitive and elegant [16:54:17] e.g. /srv/airflow-resserarch/dags => /srv/deployment/airflow-dags/research-dags [16:54:24] aye [16:54:34] but for other common imports [16:54:43] i'm wondering if a single symlink for e.g. plugins_folder would be enough? [16:54:44] maybe not. [16:54:51] i guess... [16:55:05] /srv/airflow-resserarch/plugins => /srv/deployment/airflow-dags/plugins [16:55:12] in airflow2 plugins and libraries are separate [16:55:26] before that, all was defined in plugins, but not any more [16:55:37] yes, but plugins is just in PYTHONPATH autotmatically [16:55:44] ah! understand [16:55:45] alternatively, we could make sure PYTHONPATH is correct [16:56:07] i think i'd like to make it so all the code is available within e.g. /srv/airflow-research [16:56:14] so users don't have to think too hard about /srv/deployment/... [16:56:25] I see [16:56:40] so i'd symlink things, but we don't want to manage too many symlinks [16:56:43] maybe, max 3 [16:56:48] 2 better [16:57:03] dags, plugins, wmf_common ? [16:59:07] hmm [17:14:19] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10Ottomata) > The replies above suggest that this ticket is indeed about events as the source of... [17:21:51] (03PS3) 10Jdlrobson: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) [17:21:53] (03CR) 10Jdlrobson: Restore ReadingDepth schema (033 comments) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [17:22:13] (03CR) 10Razzi: [C: 03+1] Fix webchat link from Freenode to Libera [analytics/wikistats2] - 10https://gerrit.wikimedia.org/r/737386 (owner: 10CristianCantoro) [17:24:53] razzi: o/ the refactor that you are talking about is likely end up in spicerack, and it seems a little out of scope. We can talk with Riccardo about it [17:25:17] maybe a little helper that returns different formats etc.. [17:25:27] Indeed, we can merge your patch as-is and add an upstream issue [17:26:00] Ideally the ergonomics are such that the obvious thing (self.reason) does the right thing [17:26:47] yep yep we can discuss with SRE what's best [17:35:12] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10razzi) [17:59:50] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) Following some direction from #data-persistence we have decided to use the `transfer.py` tool with th... [18:01:34] (03CR) 10Ottomata: Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [18:29:08] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10BTullis) It looks like this is the case. I can't copy a simple file between these two servers using `transfer.... [18:33:43] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10Ottomata) > Maybe I should just create the local snapshot on an-coord1001 I don't think there is room! [18:34:07] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban: Recreate analytics-meta replica on db1108 from master on an-coord1001 - https://phabricator.wikimedia.org/T295312 (10Ottomata) But, we should make transfer.py work between the VLANs anyway. [18:49:14] (03CR) 10Jdlrobson: Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [18:50:22] (03CR) 10Jdlrobson: Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [18:53:29] (03CR) 10Ottomata: [C: 03+1] Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [19:13:31] (03CR) 10Jdlrobson: Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [19:31:32] does anyone have a link to phabricator instructions on how to create rules for automated tag removal/addition? [19:33:08] olja_: https://www.mediawiki.org/wiki/Phabricator/Help/Herald_Rules [19:37:48] majavah: ty [19:47:03] yoohoo ottomata ! [19:47:16] Can you make me a gerrit repository for the presto query logger? [19:47:40] 10Analytics, 10Analytics-EventLogging, 10Epic: Epic: Engineer manages throughput of events for his schema - https://phabricator.wikimedia.org/T75941 (10odimitrijevic) 05Open→03Declined Closing due to inactivity. [19:48:12] 10Analytics, 10Analytics-EventLogging, 10Epic: Epic: ProductManager visualizes EL data - https://phabricator.wikimedia.org/T75068 (10odimitrijevic) 05Open→03Declined Closing due to inactivity [19:50:39] 10Analytics, 10Analytics-Features: Request feature - build and populate database using a LocalSettings.php file - https://phabricator.wikimedia.org/T106340 (10odimitrijevic) 05Open→03Declined [20:12:48] (03CR) 10Ottomata: [C: 03+1] Restore ReadingDepth schema (031 comment) [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [20:37:55] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Alluxio for Improved Superset Query Performance - https://phabricator.wikimedia.org/T288252 (10razzi) [20:37:57] 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Data-Engineering-Kanban, 10Epic: Analytics Presto improvements - https://phabricator.wikimedia.org/T266639 (10razzi) [20:38:03] 10Analytics, 10Data-Engineering, 10Data-Engineering-Kanban, 10User-razzi: Add a presto query logger - https://phabricator.wikimedia.org/T269832 (10razzi) 05Open→03In progress [20:49:42] (03CR) 10Clare Ming: [C: 03+1] "Tested this locally with WME patch 735690 (with a slight change from WME patch 737530)" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [20:53:33] (03CR) 10Ottomata: [C: 03+1] "Oh, as discussed in some other recent tickets, perhaps this shouldl be named analytics/mediawiki/web_ui_reading_depth? (Unless this is al" [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [21:07:23] mforns: still there? [21:19:09] (03PS4) 10Jdlrobson: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) [21:19:53] (03CR) 10Ottomata: [C: 03+1] Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [21:20:34] (03CR) 10jerkins-bot: [V: 04-1] Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [21:26:27] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Jclark-ctr) Host was still in rack d6 u7 verifed location and relocated to B1 U29 Port20 Cableid#1935 [21:27:30] 10Analytics-Clusters, 10DC-Ops, 10SRE, 10ops-eqiad: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Jclark-ctr) a:05Jclark-ctr→03Cmjohnson [21:39:35] (03PS5) 10Jdlrobson: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) [22:13:51] heya ottomata still here [22:15:13] heya [22:15:15] :) [22:15:19] so we need to create more repos [22:15:25] :(((( [22:15:27] we'll need a 'scap' repo for each airflow instance [22:15:43] usually, teams won't ever have to thikn about it [22:15:45] oh, wow... [22:16:00] we have this for refinery...but we only have one 'deployment' of refinery [22:16:26] we did this years ago for eventlogging/eventbus vs eventlogging/analytics [22:16:33] I see [22:16:41] different deployments of the python eventlogging service using the same repo [22:17:04] the only real difference in the scap repos will be the scap.cfg file and list of ssh targets to deploy to [22:17:10] i wish this could all be managed by puppet...but it isn't [22:17:36] I see, well if it's necessary, let's! [22:18:18] Can't GitLab CI be ever an alternative? [22:18:19] naming? [22:18:24] to scap? [22:18:29] yes [22:18:42] i asked tyler today, i think there is no plan to support GitLab for deployments like this [22:18:47] they are focusing on deployment pipeline stuff [22:18:50] e.g. k8s [22:18:53] and docker images [22:19:02] if we could use that we would [22:19:02] but [22:19:12] kerberos barrier...i think iw ant to start saying: kerbarrier. [22:19:13] haha [22:19:20] pipelinelib is the thing that will mostly replace scap3 but also requires switching to k8s and the whole helmfile mess [22:19:23] xDDD [22:19:31] bd808: i'd love to do that if we good [22:19:32] but [22:19:35] kerbarrier [22:19:53] k8s does not work with kerberos which means k8s cannot talk to Hadoop [22:20:13] 'we good'(?) meant to type 'we could' [22:20:15] mforns: [22:20:16] naming [22:20:18] was considering [22:20:21] data-engineering/airflow-dags-scap-analytics [22:20:22] and [22:20:27] data-engineering/airflow-dags-scap-research [22:20:28] etc. [22:20:34] or, we could put the scap repos in those team namespaces? [22:20:35] but.. isn't gmodena already deploying dags to their airflow instance with CI? [22:20:43] isn't kerberos auth at the application level and not the runtime environment level? [22:20:51] (03PS6) 10Jdlrobson: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) [22:21:01] bd808: i don't fully understand it, but i think it has somethign to do with per-node auth. [22:21:07] and nodes in k8s are ephermeral [22:21:27] (03CR) 10Clare Ming: [C: 03+2] Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [22:22:28] (03Merged) 10jenkins-bot: Restore ReadingDepth schema [schemas/event/secondary] - 10https://gerrit.wikimedia.org/r/737527 (https://phabricator.wikimedia.org/T294777) (owner: 10Jdlrobson) [22:22:40] anyway ottomata, I completely trust your view, re. naming: data-engineering/airflow-dags-scap- sound good to me! [22:23:03] i don't love these team namespaces in repository names.................but oh well [22:23:16] hm.. [22:23:20] (data-engineering) [22:23:50] (i'm ok with the scap- bit, althoguh more specifically it is scap-) [22:24:15] ottomata: *nod* I suppose you would need something like https://streamsets.com/blog/automating-kerberos-keytab-generation-for-kubernetes-based-deployments/ or some other in-cluster privileged operator to provision credentials. [22:24:35] yeah...something. [22:24:44] aha, makes sense. Nevertheless, since those repos are not going to be used a lot, the name matters less I'd say [22:24:50] yeah [22:25:04] mforns: i don't seem to be able to create repos in data-engineering [22:25:06] can you? [22:25:16] I think so! [22:25:23] buh how come I can't?!?! [22:25:26] i wanna! [22:25:35] I can create them, but I need to add you as an admin! [22:25:37] wait... [22:27:54] ottomata: ok, can you try now? [22:28:40] yes! [22:28:46] hmmm [22:28:52] people/wmf-team-data-engineering [22:28:53] ? [22:29:16] i see that in my Groups dropdown for Project URL [22:29:19] is that right? [22:29:27] yes, you belong to that group, but that group belongs to the data-engineering group, and is the admin of that namespace [22:29:38] ok, but will that be hte project URL? [22:29:40] no, right? [22:29:46] I think to create a repo under data-engineering [22:29:58] you have to... wait remembering [22:30:45] brennen just moved things I think. " < brennen> !log gitlab: finished moving releng, security, data-engineering projects to /repos (T292094)" [22:30:46] T292094: Limit GitLab shared runners to trusted contributors - https://phabricator.wikimedia.org/T292094 [22:32:17] bd808: ...not sure what that means but okayyyyy [22:32:19] :) [22:33:49] ottomata: go here and click on New project: https://gitlab.wikimedia.org/repos/data-engineering [22:34:02] nice! [22:35:33] worked, thank you [22:36:38] 10Analytics, 10Data-Engineering, 10Event-Platform, 10Platform Engineering, 10tech-decision-forum: MediaWiki Events as Source of Truth - Decision Statement Overview - https://phabricator.wikimedia.org/T291120 (10awight) >>! In T291120#7492406, @Ottomata wrote: > In either case, others have often argued th... [22:36:46] That URL is a bit hidden, I can only access by going to an existing project and clicking the breadcrums [22:37:34] so so, i guess this is PR model? [22:37:38] i should fork and push and make a PR? [22:40:00] mforns: [22:40:00] https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-analytics/-/merge_requests/1 [22:40:02] :) [22:40:10] ok i gotta run for the eve, will continue tomorrow, making some progress! [22:41:04] :] yes, it's called merge request, but it's the same as PR. I think you dont need to create a fork though, you can do a branch to the main repo [22:41:17] see you!!! [23:32:15] This is my idea for dealing with the Kerberos/K8S issue: mount keytabs into pods using volume mounts. [23:32:16] https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/#create-a-pod-that-has-access-to-the-secret-data-through-a-volume [23:32:16] The same approach would apply if we were to use a docker based GitLab runner. [23:32:16] We don't permanently store any keytabs in the containers, but they would be injected at runtime using volumes.