[09:58:47] <dcausse>	 errand + lunch
[09:58:50] <gehel>	 dcausse/ pfischer: I've created T347560 to grant deploy access to our apps. Would you have the list of applications to which you need access? Could you add it directly to the ticket?
[09:58:51] <stashbot>	 T347560: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560
[09:58:54] <gehel>	 lunch 2!
[10:51:06] <pfischer>	 dcausse: sure.
[10:51:47] <pfischer>	 dcausse: if you are looking for some post-lunch-reviews, I’ve got you covered ;-) https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests
[12:11:29] <dcausse>	 looking :)
[12:25:57] <dcausse>	 gehel: actually I think we all can deploy k8s apps already, Peter, Erik and myself are already in the deployment group
[12:26:44] <dcausse>	 unless T347560 discuss about something else than helmfile apply on the k8s namespaces we'll own (wdqs updater and sup)
[12:26:44] <stashbot>	 T347560: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560
[12:27:18] <dcausse>	 I don't think there are per-namespace perms (other than admin_ng I think)
[12:37:01] <gehel>	 Oh, that might be me not understanding the deployment process on wikikube.
[12:37:45] <gehel>	 I was talking about that with ebernhardson and he was under the impression that he required SRE to step in to deploy the helm changes he is working on 
[12:38:00] <gehel>	 So maybe we don't need anything at all.
[12:38:24] <gehel>	 Or I might have misunderstood what Erik was talking about 
[12:38:31] <dcausse>	 some of it requires ops perms (admin_ng)
[12:38:42] <dcausse>	 but I don't think we should have perms for that one
[12:39:05] <dcausse>	 for doing normal "app deploys" we should be good
[12:39:53] <dcausse>	 Janis confirmed that as long as we're in the "deployment" group we can deploy "services", that means all apps in wikikube
[12:41:34] <dcausse>	 perhaps this is about having +2 perms on deployment-charts?
[12:41:44] <dcausse>	 I have +2 so I think Erik should have too
[12:43:09] <dcausse>	 pfischer: mind checking if you have +2 perms on deployment-charts (e.g. can you see the CR+2 button on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959066)
[13:19:29] <inflatador>	 <o/
[13:24:26] <dcausse>	 o/
[13:24:30] <pfischer>	 o/
[13:28:58] <pfischer>	 dcausse: I only have +1 rights
[13:32:57] <pfischer>	 dcausse: BTW: thank you for your reviews! I rebased/amended the remaining PRs, the config one exposes the options for —pipeline.[max-]parallelism
[13:33:23] <dcausse>	 yes thanks, was digging into this
[13:40:23] <inflatador>	 dcausse planning on incrementing flink-app chart number per https://phabricator.wikimedia.org/T347521 , LMK if this would cause problems
[13:42:04] <dcausse>	 inflatador: not sure I understand, is there a specific change you want to expose?
[13:42:54] <inflatador>	 dcausse no, trying to get helmfile to redeploy `mw-page-content-change-enrich` app, right now an apply does nothing
[13:43:23] <inflatador>	 so we're not touching our experiment in this case, this is for the prod version of mw-page-content-change-enrich
[13:44:39] <dcausse>	 this is weird, are there debug flags to helmfile?
[13:45:07] <dcausse>	 I have no objection updating the chart version but I don't see how this could help?
[13:45:46] <dcausse>	 there are no resources in k8s so no chart version to compare
[13:47:17] <inflatador>	 I'm not sure it would help either...helmfile's help doesn't list a verbose option. it has "context" but that doesn't do anything 
[13:49:28] <dcausse>	 I see network policies with one updated 28h ago
[13:49:42] <dcausse>	 should have been deleted with helmfile destroy I guess
[13:50:06] <inflatador>	 the staging version of the app was broken yesterday, but I was able to destroy + apply and fix it
[13:50:19] <inflatador>	 the prod codfw version is healthy too
[13:50:49] <dcausse>	 can you run destroy on eqiad to see if it removes the network policies?
[13:51:30] <inflatador>	 Just did it
[13:52:17] <dcausse>	 ok one is gone
[13:52:28] <inflatador>	 I see a diff now, that looks promising
[13:52:32] <dcausse>	 so it seems to partially deploy things
[13:52:45] <inflatador>	 Applying now
[13:52:56] <dcausse>	 the full diff?
[13:53:14] <dcausse>	 I see the flinkdeployment
[13:53:46] <inflatador>	 watching it now
[13:54:40] <inflatador>	 how did you see it was in a partially deployed state?
[13:55:43] <dcausse>	 helmfile is supposed to populate k8s resources, deployment, networkpolicy and plenty of other things I don't know
[13:56:59] <dcausse>	 when you see things in "kubectl get networkpolicy" but nothing in others it seems to me that something was partially deployed or partially undeployed
[13:58:04] <dr0ptp4kt>	 heads up, i think the wdqs big dump from the 18th has a glitch in it. i'm checking if the newer one is any better. i'll file a ticket later. fortunately it didn't happen in the first segment of the file so importing with loadData.sh could at least happen on one segment :P
[13:58:20] <inflatador>	 It appears to be healthy now. Still not exactly sure what happened, but thanks again for your help!
[13:58:49] <dcausse>	 inflatador: not sure to understand why we ended up in this state... a bit concerning :/
[13:59:02] <dcausse>	 dr0ptp4kt: thanks for the heads up!
[13:59:38] <inflatador>	 Y, I need to go back and read what g-modena did for troubleshooting. My guess is that it was already broken and he was doing destroy + apply cycles
[13:59:44] <dr0ptp4kt>	 inflatador: do we happen to have a .jnl i can scp down? the cloudflare r2 download keeps breaking up, which is somewhere in the internet that's not my 1 Gbps ISP as best I can tell. i'm worried about saturating the link but i think i could rate limit myself if that would help
[14:00:45] <inflatador>	 dr0ptp4kt are you using add-shore's JNL file from R2?
[14:01:50] <dr0ptp4kt>	 that's the one i'm trying to download - R2 seems to be dropping the connection. the downloader isn't able to resume the download in the middle, either (i haven't checked their range handling, but i'm guessing there's _something_ on their end that struggles with resumes - not surprising because often that needs to live in full memory at the edge in many cache configurations)
[14:01:55] <dr0ptp4kt>	 gotta go, meeting...
[14:03:42] <inflatador>	 dr0ptp4kt  np, if it's an r2.dev address it will be rate-limited. You have to put an actual DNS name up there...not sure if he did that, but I can get you a fresh JNL file and put it up on my R2 domain. No idea if it will fix the download problems, but will give it a shot
[14:06:03] <inflatador>	 nope, that's not it...he is using his domain now
[14:07:22] <inflatador>	 I'll try and download it on my end. Probably won't help, but will be a good exercise if we have to open a support ticket with CF
[14:19:53] <inflatador>	 using axel to download, probably be a few hours if it actually completes
[14:21:23] <inflatador>	 Hmm, I guess the JNL file is uncompressed? Unless it has some built-in compression I'm not aware of
[14:30:15] * inflatador is getting an underwhelming 240Mbps download speed from CF on my 1Gbps connection
[14:40:40] <ebernhardson>	 \o
[14:49:29] <dcausse>	 o/
[14:51:13] <inflatador>	 <o/
[14:58:54] <dr0ptp4kt>	 inflatador: for me the .jnl is stopping after 100 GB - 200 GB (don't think it's OS RAM or whatever). comically, on the machine (win11 desktop) i was first bit by the hard drive going to sleep after inactivity. but i updated that setting (strange that an in-flight download to a drive is considered inactive, but whatever!) and know it's not that - because i did couple 8 hour downloads of the .ttl.gz files overnight a couple times now
[14:59:32] <dr0ptp4kt>	 my mbp doesn't have enough space and you know how those can be with going to sleep despite one's insistence in Settings
[15:00:33] <dr0ptp4kt>	 i do have that nice external 2 gb ssd now, though, in addition to the internal sata for the desktop, so i'm on the path to hacked up raid. i guess i have a 2gb platter disk that's free :P
[15:01:17] <inflatador>	 I have too many barely-used external disk drives . Using a 5TB spinning disk to hold the dump file
[15:01:29] <inflatador>	 I'm at ~100GB so we'll see if mine falls apart too
[15:01:47] <dr0ptp4kt>	 i think the jnl is only "compressed" in the sense of blazegraph's internal data structures; i don't think that it's gzipped, no, although the web server probably applies some of that in practice (although maybe not, i mean that thing is huge, it would have to reliably pipe to do it, blah blah blah)
[15:03:10] <gehel>	 Trey314159, inflatador: retrospective in https://meet.google.com/eki-rafx-cxi
[15:03:21] <dr0ptp4kt>	 inflatador, are you using a browser, curl, or wget or some other? i'm thinking maybe i should try to download it while i'm on the mac today, and pray for the machine to not put itself to sleep despite being on good electricity charger.
[15:03:48] <dr0ptp4kt>	 (oops, sorry to interrupt retro; my thought for me is to go to those later as i'm more active in the wdqs stuff)
[15:04:01] <gehel>	 Trey314159: sorry, did not remember you're out :/
[16:06:18] <inflatador>	 Workout, back in ~40
[16:08:06] <ryankemper>	 dr0ptp4kt: for macos I like to use this app to prevent sleeping for stuff like that: https://apps.apple.com/us/app/amphetamine/id937984704?mt=12
[16:08:38] * ryankemper had to be careful to avoid uttering the phrase "I use amphetamine" :P
[16:11:52] <dr0ptp4kt>	 i've become so paranoid about installing apps, but maybe just maybe that one would be worth it. and it seems like the eng behind it is for real? https://www.theverge.com/2021/1/2/22210295/apple-developer-amphetamine-app-violate-drug-app-store-rules 
[16:47:16] <inflatador>	 LOL...there's also "caffeinate" which I believe it a MacOS built-in
[16:47:37] <inflatador>	 also back
[17:02:14] * ebernhardson notes that chatgpt is useless for puppet since it doesn't understand that hashes became immutable since it was trained :P
[17:05:29] <inflatador>	 LOL
[17:05:53] <inflatador>	 that reminds me...need to merge that partman recipe
[17:06:16] <ebernhardson>	 maybe i should stop +1'ing those, i look and it seems plausible then mor.itz sees all the problems :p
[17:09:13] <inflatador>	 I think it's fine, as long as the blast radius is limited to us...just my opinion though
[17:11:05] <ebernhardson>	 random thought on chatgpt...with a human i can show them the right way and feel like helped something...but with chatgpt it just gives you wrong answers and doesn't consider your corrections except in the hyper-local context
[17:12:57] <inflatador>	 I haven't tried clicking the thumbs-down button yet
[17:15:16] <inflatador>	 OK, reimaging cloudelastic1007...we'll see if the recipe works this time
[17:17:39] <inflatador>	 Lunch, back in ~1h
[17:25:06] <dr0ptp4kt>	 well, the munge.sh on the newer big wikidata dump is at 28m records now and still chugging along, whereas the previous one went boom at 14m, so that's a start.
[17:31:52] <dcausse>	 dr0ptp4kt: if you still have the error of the previous error that would be helpful, we seem to have munged & loaded the 20230918 dump in hdfs (haven't double checked that the data is ok tho, just saw the partition)
[17:32:04] <dcausse>	 s/previous error/previous run/
[17:40:21] <dcausse>	 dinner
[17:43:28] <dr0ptp4kt>	 🙏
[18:12:10] <inflatador>	 back
[18:23:05] <inflatador>	 FWiW, gzipping the main wikidata JNL file takes ~4 hours using pigz at maximum compression rate, and we end up with a ~400 GB file compared to 1.2 TB uncompressed 
[18:29:16] * ebernhardson mutters are the awkwardness of helm templates
[18:29:34] <ebernhardson>	 you can't do something like {{ range [.foo, .bar] }}, because that would be too easy :P
[18:29:35] <inflatador>	 d-causse do you have any tickets/patches up yet for pipeline filtering (what we talked about with our experimental app affecting kafka-main)?
[18:33:53] <ryankemper>	 inflatador: gehel: in https://meet.google.com/eki-rafx-cxi?authuser=1
[18:34:51] <gehel>	 joining
[18:36:45] <dr0ptp4kt>	 ah, after a very quick scan of the reload, i suspect i should thow --skolemize at the thing for the file from the 18th perhaps, leaning on a hunch looking at the reload cookbook. i haven't walked the call graph of munging yet, but i can pick back up from https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/dags/import_ttl.py to see if there's anything noteworthy where import
[18:36:58] <gehel>	 ebernhardson: David was saying that you should already have deployment rights to wikikube by being in the wmf deployment group. Not sure how all that works, but could you have a look?
[18:37:00] <dr0ptp4kt>	 would succeed in one place for the dag but not work so nicely for the linear scan
[18:37:24] <dcausse>	 inflatador: yes this is https://phabricator.wikimedia.org/T347515, made a quick patch to workaround the issue at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/961858
[18:37:43] <ebernhardson>	 gehel: i will poke around and see
[18:37:58] <ebernhardson>	 apparently i don't know how it works either :)
[18:38:40] * ebernhardson is continuously surprised how helm can be so popular...and yet not even give line numbers that align with the code we write in its error messages...
[18:38:48] <dcausse>	 the thing to check is also +2 perms on deployment-charts, Peter does not seem to have them but he probably should
[18:39:05] <dr0ptp4kt>	 inflatador just saw your note on the compression. that's not bad at all! how is your download looking btw, did it suceed/fail? (sorry if i missed a note)
[18:39:32] <ebernhardson>	 dcausse: huh, apparently i do have +2 there and didnt notice
[18:40:25] <dcausse>	 not sure where they come from I assumed that this is because we're in the deployment group but no since Peter is in this group but does not have them 
[18:41:42] <dcausse>	 ebernhardson: if you to test/play with helm deployments fell free to use the rdf-streaming-updater in staging
[20:06:15] <inflatador>	 dr0ptp4kt still going, I'm at 68%, axel says ~3h left
[20:14:49] <inflatador>	 d-causse thanks for the link, didn't wanna ping ya in the middle of your night. LMK if/when the code is ready to deploy, happy to help if necessary
[20:14:56] <inflatador>	 break, back in ~20
[20:19:49] <ebernhardson>	 inflatador: i merged davids patch, would need to run the processes to release the jar (and docker img?), and then update the staging release to provide the new arg
[20:57:46] <inflatador>	 ebernhardson cool, I'll take a look
[21:03:42] <inflatador>	 OK, jar she's building
[21:19:09] * dr0ptp4kt slightly jealous of inflatador's stable connection
[21:35:10] <inflatador>	 don't jinx me ;P Still about 90 minutes to go
[21:39:40] <inflatador>	 OK, the jar build is done. Let's see if I can remember how to build a new docker img
[21:40:19] <inflatador>	 https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater
[21:51:28] <inflatador>	 hmm, looks like I don't have +2 perms on https://gerrit.wikimedia.org/g/operations/docker-images/production-images
[22:03:46] <ebernhardson>	 inflatador: hmm
[22:04:26] <inflatador>	 I swear I've published an image there before. Maybe you have to use `docker-pkg` and it does all the git stuff for you 
[22:04:33] <ebernhardson>	 you certainly used to: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/908486
[22:05:48] <ebernhardson>	 ACL's say inherit from operations/puppet and allow ldap/ops
[22:05:57] <ebernhardson>	 so...you really should have access
[22:06:28] <inflatador>	 I guess it's a GUI thing
[22:06:49] <ebernhardson>	 it has +2 in the submit review modal, but not +2 on the main page?
[22:06:49] <inflatador>	 if I hit the "reply" button it seems I still have the perms
[22:06:53] <inflatador>	 Y
[22:07:01] <ebernhardson>	 i've seen that before...but not sure what causes it
[22:07:38] <inflatador>	 Oh well, I'm going to shove off for the day...will build the flink and the new rdf-streaming-updater images tomorrow
[22:08:56] <ebernhardson>	 actually thinking about it, reasonable chance the difference is V+2
[22:09:15] <ebernhardson>	 would have to test, but i think it only gives the CR+2 button if there is a V+2 vote already
[22:09:23] <inflatador>	 oh and there probably needs to be a patch to production-images for the rdf-streaming-updater stuff too...and if/how https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater is part of the process...
[22:31:37] * ebernhardson wants higher-order functions in helm...not going to happen :P
[22:35:24] * ebernhardson was wrong, helm has a map function. hooray!
[22:38:15] <inflatador>	 ooooh, that's news to me too
[22:38:37] <ebernhardson>	 still determining if it does what i need, maybe :)
[23:01:01] <ebernhardson>	 no, it has nothing useful. I just need to write the same code twice :P