[09:58:47] errand + lunch [09:58:50] dcausse/ pfischer: I've created T347560 to grant deploy access to our apps. Would you have the list of applications to which you need access? Could you add it directly to the ticket? [09:58:51] T347560: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560 [09:58:54] lunch 2! [10:51:06] dcausse: sure. [10:51:47] dcausse: if you are looking for some post-lunch-reviews, I’ve got you covered ;-) https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests [12:11:29] looking :) [12:25:57] gehel: actually I think we all can deploy k8s apps already, Peter, Erik and myself are already in the deployment group [12:26:44] unless T347560 discuss about something else than helmfile apply on the k8s namespaces we'll own (wdqs updater and sup) [12:26:44] T347560: Deployment access for Search Platform SWE on Flink WDQS and Search pipelines - https://phabricator.wikimedia.org/T347560 [12:27:18] I don't think there are per-namespace perms (other than admin_ng I think) [12:37:01] Oh, that might be me not understanding the deployment process on wikikube. [12:37:45] I was talking about that with ebernhardson and he was under the impression that he required SRE to step in to deploy the helm changes he is working on [12:38:00] So maybe we don't need anything at all. [12:38:24] Or I might have misunderstood what Erik was talking about [12:38:31] some of it requires ops perms (admin_ng) [12:38:42] but I don't think we should have perms for that one [12:39:05] for doing normal "app deploys" we should be good [12:39:53] Janis confirmed that as long as we're in the "deployment" group we can deploy "services", that means all apps in wikikube [12:41:34] perhaps this is about having +2 perms on deployment-charts? [12:41:44] I have +2 so I think Erik should have too [12:43:09] pfischer: mind checking if you have +2 perms on deployment-charts (e.g. can you see the CR+2 button on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959066) [13:19:29] o/ [13:24:30] o/ [13:28:58] dcausse: I only have +1 rights [13:32:57] dcausse: BTW: thank you for your reviews! I rebased/amended the remaining PRs, the config one exposes the options for —pipeline.[max-]parallelism [13:33:23] yes thanks, was digging into this [13:40:23] dcausse planning on incrementing flink-app chart number per https://phabricator.wikimedia.org/T347521 , LMK if this would cause problems [13:42:04] inflatador: not sure I understand, is there a specific change you want to expose? [13:42:54] dcausse no, trying to get helmfile to redeploy `mw-page-content-change-enrich` app, right now an apply does nothing [13:43:23] so we're not touching our experiment in this case, this is for the prod version of mw-page-content-change-enrich [13:44:39] this is weird, are there debug flags to helmfile? [13:45:07] I have no objection updating the chart version but I don't see how this could help? [13:45:46] there are no resources in k8s so no chart version to compare [13:47:17] I'm not sure it would help either...helmfile's help doesn't list a verbose option. it has "context" but that doesn't do anything [13:49:28] I see network policies with one updated 28h ago [13:49:42] should have been deleted with helmfile destroy I guess [13:50:06] the staging version of the app was broken yesterday, but I was able to destroy + apply and fix it [13:50:19] the prod codfw version is healthy too [13:50:49] can you run destroy on eqiad to see if it removes the network policies? [13:51:30] Just did it [13:52:17] ok one is gone [13:52:28] I see a diff now, that looks promising [13:52:32] so it seems to partially deploy things [13:52:45] Applying now [13:52:56] the full diff? [13:53:14] I see the flinkdeployment [13:53:46] watching it now [13:54:40] how did you see it was in a partially deployed state? [13:55:43] helmfile is supposed to populate k8s resources, deployment, networkpolicy and plenty of other things I don't know [13:56:59] when you see things in "kubectl get networkpolicy" but nothing in others it seems to me that something was partially deployed or partially undeployed [13:58:04] heads up, i think the wdqs big dump from the 18th has a glitch in it. i'm checking if the newer one is any better. i'll file a ticket later. fortunately it didn't happen in the first segment of the file so importing with loadData.sh could at least happen on one segment :P [13:58:20] It appears to be healthy now. Still not exactly sure what happened, but thanks again for your help! [13:58:49] inflatador: not sure to understand why we ended up in this state... a bit concerning :/ [13:59:02] dr0ptp4kt: thanks for the heads up! [13:59:38] Y, I need to go back and read what g-modena did for troubleshooting. My guess is that it was already broken and he was doing destroy + apply cycles [13:59:44] inflatador: do we happen to have a .jnl i can scp down? the cloudflare r2 download keeps breaking up, which is somewhere in the internet that's not my 1 Gbps ISP as best I can tell. i'm worried about saturating the link but i think i could rate limit myself if that would help [14:00:45] dr0ptp4kt are you using add-shore's JNL file from R2? [14:01:50] that's the one i'm trying to download - R2 seems to be dropping the connection. the downloader isn't able to resume the download in the middle, either (i haven't checked their range handling, but i'm guessing there's _something_ on their end that struggles with resumes - not surprising because often that needs to live in full memory at the edge in many cache configurations) [14:01:55] gotta go, meeting... [14:03:42] dr0ptp4kt np, if it's an r2.dev address it will be rate-limited. You have to put an actual DNS name up there...not sure if he did that, but I can get you a fresh JNL file and put it up on my R2 domain. No idea if it will fix the download problems, but will give it a shot [14:06:03] nope, that's not it...he is using his domain now [14:07:22] I'll try and download it on my end. Probably won't help, but will be a good exercise if we have to open a support ticket with CF [14:19:53] using axel to download, probably be a few hours if it actually completes [14:21:23] Hmm, I guess the JNL file is uncompressed? Unless it has some built-in compression I'm not aware of [14:30:15] * inflatador is getting an underwhelming 240Mbps download speed from CF on my 1Gbps connection [14:40:40] \o [14:49:29] o/ [14:51:13] inflatador: for me the .jnl is stopping after 100 GB - 200 GB (don't think it's OS RAM or whatever). comically, on the machine (win11 desktop) i was first bit by the hard drive going to sleep after inactivity. but i updated that setting (strange that an in-flight download to a drive is considered inactive, but whatever!) and know it's not that - because i did couple 8 hour downloads of the .ttl.gz files overnight a couple times now [14:59:32] my mbp doesn't have enough space and you know how those can be with going to sleep despite one's insistence in Settings [15:00:33] i do have that nice external 2 gb ssd now, though, in addition to the internal sata for the desktop, so i'm on the path to hacked up raid. i guess i have a 2gb platter disk that's free :P [15:01:17] I have too many barely-used external disk drives . Using a 5TB spinning disk to hold the dump file [15:01:29] I'm at ~100GB so we'll see if mine falls apart too [15:01:47] i think the jnl is only "compressed" in the sense of blazegraph's internal data structures; i don't think that it's gzipped, no, although the web server probably applies some of that in practice (although maybe not, i mean that thing is huge, it would have to reliably pipe to do it, blah blah blah) [15:03:10] Trey314159, inflatador: retrospective in https://meet.google.com/eki-rafx-cxi [15:03:21] inflatador, are you using a browser, curl, or wget or some other? i'm thinking maybe i should try to download it while i'm on the mac today, and pray for the machine to not put itself to sleep despite being on good electricity charger. [15:03:48] (oops, sorry to interrupt retro; my thought for me is to go to those later as i'm more active in the wdqs stuff) [15:04:01] Trey314159: sorry, did not remember you're out :/ [16:06:18] Workout, back in ~40 [16:08:06] dr0ptp4kt: for macos I like to use this app to prevent sleeping for stuff like that: https://apps.apple.com/us/app/amphetamine/id937984704?mt=12 [16:08:38] * ryankemper had to be careful to avoid uttering the phrase "I use amphetamine" :P [16:11:52] i've become so paranoid about installing apps, but maybe just maybe that one would be worth it. and it seems like the eng behind it is for real? https://www.theverge.com/2021/1/2/22210295/apple-developer-amphetamine-app-violate-drug-app-store-rules [16:47:16] LOL...there's also "caffeinate" which I believe it a MacOS built-in [16:47:37] also back [17:02:14] * ebernhardson notes that chatgpt is useless for puppet since it doesn't understand that hashes became immutable since it was trained :P [17:05:29] LOL [17:05:53] that reminds me...need to merge that partman recipe [17:06:16] maybe i should stop +1'ing those, i look and it seems plausible then mor.itz sees all the problems :p [17:09:13] I think it's fine, as long as the blast radius is limited to us...just my opinion though [17:11:05] random thought on chatgpt...with a human i can show them the right way and feel like helped something...but with chatgpt it just gives you wrong answers and doesn't consider your corrections except in the hyper-local context [17:12:57] I haven't tried clicking the thumbs-down button yet [17:15:16] OK, reimaging cloudelastic1007...we'll see if the recipe works this time [17:17:39] Lunch, back in ~1h [17:25:06] well, the munge.sh on the newer big wikidata dump is at 28m records now and still chugging along, whereas the previous one went boom at 14m, so that's a start. [17:31:52] dr0ptp4kt: if you still have the error of the previous error that would be helpful, we seem to have munged & loaded the 20230918 dump in hdfs (haven't double checked that the data is ok tho, just saw the partition) [17:32:04] s/previous error/previous run/ [17:40:21] dinner [17:43:28] 🙏 [18:12:10] back [18:23:05] FWiW, gzipping the main wikidata JNL file takes ~4 hours using pigz at maximum compression rate, and we end up with a ~400 GB file compared to 1.2 TB uncompressed [18:29:16] * ebernhardson mutters are the awkwardness of helm templates [18:29:34] you can't do something like {{ range [.foo, .bar] }}, because that would be too easy :P [18:29:35] d-causse do you have any tickets/patches up yet for pipeline filtering (what we talked about with our experimental app affecting kafka-main)? [18:33:53] inflatador: gehel: in https://meet.google.com/eki-rafx-cxi?authuser=1 [18:34:51] joining [18:36:45] ah, after a very quick scan of the reload, i suspect i should thow --skolemize at the thing for the file from the 18th perhaps, leaning on a hunch looking at the reload cookbook. i haven't walked the call graph of munging yet, but i can pick back up from https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/blob/main/search/dags/import_ttl.py to see if there's anything noteworthy where import [18:36:58] ebernhardson: David was saying that you should already have deployment rights to wikikube by being in the wmf deployment group. Not sure how all that works, but could you have a look? [18:37:00] would succeed in one place for the dag but not work so nicely for the linear scan [18:37:24] inflatador: yes this is https://phabricator.wikimedia.org/T347515, made a quick patch to workaround the issue at https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/961858 [18:37:43] gehel: i will poke around and see [18:37:58] apparently i don't know how it works either :) [18:38:40] * ebernhardson is continuously surprised how helm can be so popular...and yet not even give line numbers that align with the code we write in its error messages... [18:38:48] the thing to check is also +2 perms on deployment-charts, Peter does not seem to have them but he probably should [18:39:05] inflatador just saw your note on the compression. that's not bad at all! how is your download looking btw, did it suceed/fail? (sorry if i missed a note) [18:39:32] dcausse: huh, apparently i do have +2 there and didnt notice [18:40:25] not sure where they come from I assumed that this is because we're in the deployment group but no since Peter is in this group but does not have them [18:41:42] ebernhardson: if you to test/play with helm deployments fell free to use the rdf-streaming-updater in staging [20:06:15] dr0ptp4kt still going, I'm at 68%, axel says ~3h left [20:14:49] d-causse thanks for the link, didn't wanna ping ya in the middle of your night. LMK if/when the code is ready to deploy, happy to help if necessary [20:14:56] break, back in ~20 [20:19:49] inflatador: i merged davids patch, would need to run the processes to release the jar (and docker img?), and then update the staging release to provide the new arg [20:57:46] ebernhardson cool, I'll take a look [21:03:42] OK, jar she's building [21:19:09] * dr0ptp4kt slightly jealous of inflatador's stable connection [21:35:10] don't jinx me ;P Still about 90 minutes to go [21:39:40] OK, the jar build is done. Let's see if I can remember how to build a new docker img [21:40:19] https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater [21:51:28] hmm, looks like I don't have +2 perms on https://gerrit.wikimedia.org/g/operations/docker-images/production-images [22:03:46] inflatador: hmm [22:04:26] I swear I've published an image there before. Maybe you have to use `docker-pkg` and it does all the git stuff for you [22:04:33] you certainly used to: https://gerrit.wikimedia.org/r/c/operations/docker-images/production-images/+/908486 [22:05:48] ACL's say inherit from operations/puppet and allow ldap/ops [22:05:57] so...you really should have access [22:06:28] I guess it's a GUI thing [22:06:49] it has +2 in the submit review modal, but not +2 on the main page? [22:06:49] if I hit the "reply" button it seems I still have the perms [22:06:53] Y [22:07:01] i've seen that before...but not sure what causes it [22:07:38] Oh well, I'm going to shove off for the day...will build the flink and the new rdf-streaming-updater images tomorrow [22:08:56] actually thinking about it, reasonable chance the difference is V+2 [22:09:15] would have to test, but i think it only gives the CR+2 button if there is a V+2 vote already [22:09:23] oh and there probably needs to be a patch to production-images for the rdf-streaming-updater stuff too...and if/how https://gitlab.wikimedia.org/repos/search-platform/flink-rdf-streaming-updater is part of the process... [22:31:37] * ebernhardson wants higher-order functions in helm...not going to happen :P [22:35:24] * ebernhardson was wrong, helm has a map function. hooray! [22:38:15] ooooh, that's news to me too [22:38:37] still determining if it does what i need, maybe :) [23:01:01] no, it has nothing useful. I just need to write the same code twice :P