[01:36:17] what time is the office hours going to be? [07:15:18] O/ [07:18:10] o/ [08:53:13] dcausse: I just saw https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/961144 - I assume that SUP should also default to the committed kafka topic offsets? [08:54:16] pfischer: possibly but with an auto-reset to the end of the topic? [08:54:54] well might depend on the pipeline actually [08:55:22] the producer might prefer the end of the topic, the consumer the beginning? [09:01:56] 👀 [09:47:17] Oh cool - you're using kafka to create your index? That's what I was going to show you: I made a java app that downloads the dumps, parses the wiki text directly, adds vectors to each document, and also adds a bunch of NER NLP fields to each doc and everything is done through kafka and grpc. The NLP and embeddings are done via gRPC services and the pipelines are done through kafka. Then at the end [09:47:17] , you can have any number of datastores consume the data. [09:48:11] all the pipeline steps are protocolbuffer messages too, which are validated through a kafka schema registry [09:49:13] I was able to get my laptop to calculate about 100 vectors a second, although I have to validate the data a bit more to make sure it was truly thread safe [10:36:50] I'd love to learn more about how you setup the pipeline in kafka. I came up with a design, but not sure if it's a good one. [10:44:25] kristianrickert: very curious to learn more about your setup, the office hours are 15:00-16:00 UTC today [10:45:09] lunch [13:17:10] o/ [13:28:58] Just saw your update on T326914 , does that mean we have to update kafka or use an older version of flink? [13:28:59] T326914: Migrate the WDQS streaming updater from FlinkKafkaConsumer/Producer to KafkaSource/Sink - https://phabricator.wikimedia.org/T326914 [13:34:36] dcausse: curious too....it works for us? can't totally recall why we used that version of kafka clients but it was the one that worked? [13:36:27] inflatador, ottomata: actually it works, it's just our code base that was pulling an old version of kafka-clients (2.4.1 instead of the 3.2.3 version that the flink connector expects) [13:37:15] it was "almost" compatible since most of it worked except some edge cases I suspect when recycling internal kafka producers [13:37:59] will do a last quick test and ship a patch [13:39:34] ottomata: the kafka-clients v2.4.1 certainly works but what's not stable is how flink might use it in the context of transactional producers (EOS) [13:39:50] flink wants 3.2.3 [13:40:22] sigh... zoom's crashing my browser tab every 5min :( [13:40:39] LOL, I still haven't been able to login with Zoom [13:41:24] I sent an email to tech support, we'll see what happens I guess [13:42:19] trying chrome :/ [13:43:20] I guess I'll try that too, struck out w/other browsers [13:52:49] oh huh [13:52:56] dcausse pfischer do y'all want to do pairing today, or are y'all doing wikiconnect? [13:53:34] inflatador: listening to wikiconnect session but happy to pair if you want [13:55:58] dcausse that's OK, let's call it off for today [13:56:06] sure [13:56:14] ^^ pfischer [13:56:31] if you need any code reviews or deploys, you know where to find me ;P [13:56:49] thanks! :) [14:02:31] inflatador: thanks, I’m currently attending the keynote [14:02:56] aaandd...chrome doesn't work for me either /shrug [14:07:25] If any of y'all can find a US call-in option in the zoom UI, maybe DM it to me along with the access code? [14:08:10] inflatador: some pom to review: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/963322/ if you want :) [14:09:07] inflatador: looking but there are multiple sessions [14:09:22] ;) [14:17:00] +1'd the pom patch [14:19:10] dcausse: +2 [14:21:02] thanks! [14:50:35] new partman recipe is ready for review: https://gerrit.wikimedia.org/r/c/operations/puppet/+/963328 [14:51:12] hah, I was just looking at phab [14:52:28] I had the same PXE boot issues as you earlier ;) [14:52:56] fascinating, I'm wondering how widespread it is [14:53:09] did you have to change the 2nd NIC in the BIOS? That's what confused me [14:53:19] as in the NIC bios, that is [14:53:51] that was correct for me, what I had to set is the boot protocol from the ctrl-s boot menu [14:54:12] it was set to none, I set it to PXE, and have no idea how it got to that state [14:54:32] Y same here, but it was the 2nd NIC in CTRL-S [14:57:17] ack [14:58:15] you might need to run the provision cookbook, as all the bios config is nowdays automated and should not be done manually [14:58:18] get in touch with dcops [14:59:28] that part should be (theoretically) done by DC Ops already per T342538 . But I did have to update the FW already [14:59:28] T342538: Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 [15:00:55] volans is there a way for me to tell if I need to run the provisioning cookbook (netbox?) or should I just hand to DC Ops after reimage? [15:01:42] I'm just saying that if the NIC's PXE setting was wrong it would be better to run the provisioning cookbook unless the host has a too old version of iDRAC [15:02:49] with the appropriate options [15:02:57] OK, got it. I'm pretty sure they did run it before handoff but I can check the SAL. All the boxes were checked in that ticket, but I can confirm that the NIC BIOS was wrong, so maybe they missed something [15:05:52] another partman error, boo [15:06:44] `mdadm: cannot open /dev/sda3: No such file or directory` . hmmm [15:10:06] inflatador: you should include the raid0.cfg bits too [15:10:12] elastic*) echo partman/standard.cfg partman/raid0.cfg partman/raid0-2dev.cfg ;; \ [15:10:18] vs [15:10:19] cloudelastic100[7-9]|cloudelastic1010) echo partman/standard.cfg partman/raid0-3dev.cfg ;; \ [15:10:39] note that partman/raid0.cfg is missing [15:11:19] godog got it, patch forthcoming [15:11:40] * inflatador misses kickstart ;( [15:12:11] you should have seen what it looked like before the standard partman recipes [15:14:25] I believe you...just looking at the wikitech page for partman says a lot about the quality of that tool ;) . Seriously though, thanks for helping out w/that [15:14:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/963334 is the new patch [15:15:09] LGTM [15:21:57] OK, here goes the reimage... [16:00:14] So far so good..thanks go-dog and v-olans! [16:02:15] workout, back in ~40 [16:32:41] getting `Nagios_host resource with title cloudelastic1007 not found yet` from the reimage. Assuming that's b/c provisioning cookbook was never run. Handing back to DC Ops [16:40:22] also back [17:17:11] inflatador: would you have a sec for https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/963383 ? [17:20:34] nvm, +2ed [17:30:10] sorry, was afk [17:31:31] np! [17:31:43] phew... finally working again [17:32:09] excellent! [17:32:15] inflatador: you can resume your testing on the dse-k8s flink job whenever you want [17:32:44] dcausse thanks! Heading to lunch soon, but will test when I get back [17:55:34] lunch, back in ~45 [18:25:17] I just got a break from work now [18:25:29] is the open office hours still going on or did I miss out? [18:25:41] kristianrickert: it was earlier :( [18:26:03] bummer.. I'll just have more to show next month then [18:26:21] kristianrickert: typically 8am-9am pacific time. am curious to see what you're doing with the augmented data [18:26:37] but wildly curious about knowing more about how you're setting up kafka pipeline [18:26:53] so what I'm doinng with the wiki data: [18:27:12] it's all open source, although it's not particularly well documented: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater [18:27:19] One problem I regularly have at work is an inability to quickly have a massive set of clean data. Also, we have a lot of data scientists at work [18:27:51] and I designed a pipeline that parses the wiki data in a super clean way that makes it easy to insert grpc services during the document enrichment process [18:28:14] i suppose thats similar to elasticsearch ingestion pipelines? We never used those though [18:28:14] so if a data scientist tells me they have a new idea with text data, I can quickly test it with wikipedia data [18:28:19] yes! [18:28:22] very similar [18:28:27] but I'm making it agnostic [18:28:31] sure [18:28:34] part of the problem I deal with [18:28:42] bigwig managers who get gartner group reports [18:29:00] and they go "why don't we just use ____" (fill in blank from what they read on wired.com) [18:29:09] lol, yea of course :) The devil is in the details [18:29:14] so I can just take the kafka sink data [18:29:19] and put it in any engine I want [18:29:31] and do an apples to apples comparison with the same data enrichment [18:29:40] nice, sounds convenient [18:29:44] I'm just about done with this btw [18:30:02] I have it inserting and calculating vector embeddings for semantic search [18:30:27] and doing NER on the body text for organizations, locations, persons, and date [18:30:51] it's a good example of how to create a gRPC service to integrate in the pipeline [18:31:02] for ours there really isn't much of an enrichment step, we accept "primary" events from mediawiki about things like edits, deletes, etc. These don't really do augmentation, they simply make an api call to mediawiki to fetch whatever the elasticsearch doc representation is [18:31:22] ahhh.... so that's another question I have [18:31:32] right now I make a dump ID when I parse the data dump [18:31:36] there are also secondary events that turn into updates, but those are specific to a single field we call weighted_tags which lets other teams add flags to pages that are then recalled during search (such as a has-link-recommendations tag) [18:31:41] but there's a lot of ways to find deletes [18:31:50] (go on - sorry ) [18:32:17] hmm, i'm not sure if you have access to the same event streams we do [18:32:27] oh certainly not [18:32:48] how i'm doing it - kafka just keeps the latest version of the kafka ID of the message [18:32:53] deletes in particular are difficult because there is an edge case related to how things are suppressed (edits that get completely hidden) [18:33:45] so, your thing with a dump id, i think i don't quite follow. [18:33:47] so for now I just keep the same data from the previous message and update the dump date if it didn't change. But it'll re-process the enhancements if it's changed because it just looks at the last updated date [18:34:06] I'm not 100% sure if it'll work to be honest - the general gist thoughL [18:34:14] is that you do a full dump [18:34:18] could look at version numbers, rev id's basically, but those don't capture all possible content changes [18:34:22] and I get all the IDs of that dump. [18:34:23] yeah! [18:34:27] ahhh [18:34:30] there's the crux [18:34:34] they capture most of the important changes though, the other changes come through template rerenders [18:34:52] so what changes are in the template renders? [18:35:08] if you edit a template thats on a million pages, a million pages will be rerendered and updated in our search [18:35:20] I mean, this would be "good enough" - I can even do a lower priority topic instead for those that have the same revision IDs I guess? [18:35:29] ohhhhh [18:35:47] so I'm parsing the wikimedia format in the dump and turning it into plain text [18:35:48] it will take time though :) That gets spread across time through the job queues [18:35:52] so I likely am not missing naything [18:36:04] yea in that case you have the normal bits [18:36:24] so that should work, and if not I'll still have caching turned on by the revision ID [18:36:31] so reprocessing can still be fast [18:37:07] Calculating the vectors is blazing fast though, it's the NLP step that is a bottleneck [18:37:37] but since they're all grpc microservices, you can in theory scale it so kafka is the bottleneck. [18:37:59] thats normal :) We recently deployed a change to our analysis chain that made indexing three times slower. It's easy to get expensive there :) That was basic analysis stuff though that we were able to rewrite into a custom elastic plugin [18:38:26] it's just too easy to spend a bunch of compute processing text [18:38:31] yeah the revision ID is going to help me a lot of reprocessing [18:38:44] especially if it's one of those evil dynamic languages [18:39:10] https://github.com/krickert/search_indexer [18:39:13] this is the project [18:39:17] in this narrow case it ammounted to replacing generic regex based analysis chain with direct java code [18:39:19] I gotta update the docs [18:39:38] yeah, that's why I ran with grpc [18:39:51] our data scientists do everything in python [18:40:04] yup, pretty normal [18:40:16] like the vector calculations - I can only get my machine at home to do like 10/second. On java I got it to do 200/second [18:40:51] so the data scientists tell me how do it in python and if it sucks, I just redo it in java and give them a grpc client [18:40:56] nice, i wonder if the jvm managed to magic up some auto-vectorization or some such [18:41:10] It's all in the threda model [18:41:13] can be a huge difference if the compiler uess the actual advanced cpu features :) [18:41:21] python threading is very shitty [18:41:24] ya [18:41:37] and java can use threads and the GPU really effectivly [18:41:46] either that or my code is buggy and a lie [18:41:47] oh, if python was doing a single threaded computation sure that would be terrible, i guess i was assuming python punted that out to a C module [18:42:06] yeah, I just asked google bard to do it in java [18:42:16] not kidding, and was shocked to see it was accurate and works [18:42:24] same vectors as the python one I had [18:42:41] that is surprising, i ask those things something a few times a day. But sometimes i wonder why because it's terrible almost as often as it's good [18:42:43] but i've not tested the 200/sec one yet with a large corpus [18:42:57] oh i'm not a python dev [18:43:07] so it's likely that I'm dumb with enhancing the code [18:43:16] possibly :) I get the same learning helm recently [18:43:20] but I've been a java nerd since the 90s [18:43:32] Yeah, solr has a helm chart project now [18:43:34] I think they all do [18:43:39] that's where everything is going [18:43:45] my work doesn't allow helm [18:44:01] we are in the process of migrating, i think mediawiki-on-k8s is serving like 10% of traffic now [18:44:04] it's not 2 decades old and gartner group didn't come up with propaganda for it yet [18:44:18] oh nice [18:44:58] I figured out a few fun cases for the editing data I wanna play wtih [18:44:59] for various definitions of nice :) Our helm tooling is far behind what we've had in puppet for the last decade (and to be fair, i'm more familiar with both puppet and our specific puppet) [18:45:17] we used puppet when I was at etsy [18:45:23] currently i'm duplicating alot of random things around in places because helm doesn't have the right data available in the right formats [18:46:13] well if it helps - I'm dealing with an IT department that doesn't even allow k8s because "kubernetes bad" [18:46:31] you have AB testing in place there right? [18:46:36] it might be :) But the promise of on-demand compute is too great [18:46:55] you saying 10% assumed that you likely do [18:46:56] we haven't run any AB tests in awhile, back when we were focused on relevance work we put together some bits and it's still deployed [18:47:12] ahh thats separate, i think that 10% happens not within AB testing but somewhere in our traffic layers [18:47:24] got it [18:47:30] this is all insanely interesting [18:47:35] sadly mediawiki doesn't have a unified AB testing platform, just metrics collection, so bucketing and all that is however you want to do it [18:47:44] no one does :) [18:51:45] back [19:31:22] lunch [20:05:24] Patch up for the new graph split hosts. It's just stop/masking the updater so it doesn't muck with the data https://gerrit.wikimedia.org/r/c/operations/puppet/+/963404 [20:06:17] I mainly just copied the puppet code we wrote for setting up NFS dumps (T325114) , feel free to suggest changes [20:06:18] T325114: Update wdqs/wcqs data reload cookbook to use NFS mounts instead of external site and autodetect kafka timestamp from dumps - https://phabricator.wikimedia.org/T325114 [20:07:54] will probably have to update the wdqs co_okbooks too, looking at that now [20:23:25] back [20:24:04] * ebernhardson goes back to wondering what exactly about helm-linter fails when i don't define an `eqiad` environment [20:41:12] ebernhardson still working on https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/959059 ? I was gonna give my +1 but can wait if you're still hackin' away [20:41:39] inflatador: i think that one (and the parent one) are good to go [20:42:08] the streaming updater one should also be close...the last step i'm doing now is to convince it that it's ok to only define the staging-eqiad environment (so we don't accidently deploy with the prod config) [20:42:26] cool, I'll refrain from my bikeshed-level comments ;P [20:45:55] other random fun things...i was trying to understand the test suite so i could see why it's failing. It turns out if you change any files in the test suite, the test suite flips modes and runs all tests instead of just the tests for things that were changed [20:46:08] the implementation (ruby) files i mean [20:47:54] ಠ╭╮ಠ [20:51:01] I have to drop off my kids, but will merge once I get back in ~25m [21:17:58] and the reason is...there is a list of envs they will look for burried in the test suite, and staging-eqiad isn't in that list :P [21:18:50] staging-eqiad is in other areas though...curious. For example on the prod hosts the general-staging.yaml is just a symlink to general-staging-eqiad.yaml, and there is a matching staging-codfw. It seemed more sensible to be explicit [21:19:00] * ebernhardson will just keep calling it staging then [21:43:54] back, but it took a lot longer than I thought [21:52:45] ebernhardson I gave +1 but wanted to wait until g-modena got back until merging. Just want to make sure his app (which uses the same chart w/native kubernetes ha as opposed to zk) wouldn't need updating [21:53:46] if you're sure it won't, I'm fine w/merging nwo [21:53:49] or now [21:54:22] inflatador: his could be updated, but it would change some of the paths. The patch as-is adds an optional feature so nothing in his app will change [21:54:48] that can be verified through the diff in CI output as well, only the chart version number changes [22:00:46] ebernhardson understood, just merged the first two int the chain [22:28:58] in theory, the helm chart for cirrus-streaming-updater should be ready for review again. With any luck maybe we can try and fail deploying it tomorrow or friday :) [22:30:44] it's trimmed down to only the staging release, consumer pointed to relforge (need to create index there, takes a few min), it will be reading from the staging kafka cluster so no clue if there are any events there