[09:55:28] dcausse: I made an attempt at listing impacts onhttps://docs.google.com/document/d/1Hf_JEjCay55x9ZKlFmHYYqMx00sOjRDNXnIVnGyBgqs/edit as per our discussion. Feedback welcomed! [09:58:26] gehel: thanks, I'll take a look shortly [10:25:34] gehel: Erik made an effort to get me archiva deployment rights, currently this requires your approval. Could have a look please: https://phabricator.wikimedia.org/T352475 ? [10:26:07] pfischer: done [10:45:13] gehel: Thanks! [11:03:56] lunch [11:18:13] dcausse: Seems like serviceops would prefer to treat `page_rerender` cautiously an go with the (per-broker) standard settings: 1 partition, 7 days retention. The reasoning is, that there is no way to configure topics with configuration as code: https://phabricator.wikimedia.org/T351503#9378373 - I don’t want to be to pushy about it, although those settings (partitions + compaction) seem reasonable and valuable. What do [11:18:13] you think? [12:12:46] Hi! Is there an Office Hour this week, or will that happen next year again after the hollidays? It only lists the one in November here: https://etherpad.wikimedia.org/p/Search_Platform_Office_Hours [12:51:47] pfischer: I think that they worry about the added load on the kafka brokers so perhaps we'd have to prove that it's sustainable on another kafka cluster (jumbo?) before being elligible to kafka-main [12:53:29] for our cause it does seem that we don't strictly need compaction yet so I feel that requiring this option on kafka-main might delay things on our side so perhaps better to start without compaction? [12:54:16] Kristbaum: there will be one, I haven't updated the page and sent the invite yet [12:56:20] gehel Ah thanks! [12:56:21] I also feel that compaction might be valuable on other topics so perhaps the other way would be to start another topic related to the benefits on compaction and ask event platform to lead this work? [12:56:37] s/topic/ticket/ [12:57:22] so that we don't block progress on the SUP but still have something in the backlog regarding compactions [13:38:23] gehel: have time for a quick chat? [13:43:26] dcausse: meet.google.com/txz-ovdz-zcx [13:53:45] pfischer: on the other hand I'm unsure why you stroke the increase partitions part? I think that part is actually required by service-ops to better balance the topic? [14:12:37] dcausse: because of Lucas’ request “Shall we start with only a partition count change (if needed) and monitor the size of the Kafka topic after the first traffic increments?” https://phabricator.wikimedia.org/T351503#9378280 [14:15:00] o/ [14:15:37] dcausse: To me that falls under the same restriction of “we don’t want per topic config as long as we do not have tooling for it”. So maybe we have to create the need first. [14:15:56] o/ [14:17:04] pfischer: ok I thought that we'd need the increased partition count anyways but if they're ok with 1 I'm fine with it [14:17:49] the inability to have that config in puppet is not a blocker for partition count increase I think, there are manu topics with multiple partitions [14:30:31] dcausse: Alright, I’ll ask for that change then. [14:31:17] Side note: brouberol has a few ideas on how to better manage kafka configuration in puppet. But that's going to wait until at least next Q. [14:32:08] Yes, we had a discussion on slack and he mentioned that https://wikimedia.slack.com/archives/C02291Z9YQY/p1701689593142679 [14:32:12] I think the hesitation from Luca is more a mix of both issues (we've never done it before and we're generally very conservative with kafka-main + we have no sane way to store per topic config) [14:33:01] Yes, totally understandable from a maintenance/responsiblilty perspective [14:49:25] there's a bit of a build vs buy question in my mind. It's simple enough to write a script that reads configuration and applies it to kafka, but there are also tools out there as well that do that. Oftentimes, they also do more, meaning I'm always going back and forth between building and adopting a 3p tool. [14:50:04] in any case, having configuration management for kafka topics is something I'd really like to see [15:12:47] gehel, dr0ptp4kt I should be done with the milestones exaplanation, feedback welcome [15:21:21] should we call a mtg with Luca/service ops to discuss further? [15:29:12] \o [15:29:35] .o/ [15:35:11] hmm, actually the permissions ticket might not be necessary anymore, i wanted to make sure we didn't forget and filled the ticket, but i think brian already did the thing (grant ldap group access) [15:36:45] o/ [15:40:05] spent some time with a local archiva instance talking to the labs-ldap instance, couldn't figure out how to make archiva work with ldap even having the prod archiva.xml configuration available. That system is weird :P [15:40:45] or really, it's just incredibly awkward and provides no help configuring it properly even though ldap is extremely flexible and needs lots of knobs turned to query the right data out [15:41:55] :/ [15:43:05] i suspect though that the roles system in prod is completely borked, my theory is it can find usernames but it can't find the group memberships, so only users that exist in the archiva database can upload jars [15:43:11] not users in the ldaps db [15:43:53] so we have to manually add users to the archiva db? [15:44:12] oh boy ;P [15:44:30] thats my best theory after working with it friday, we either need to use the jenkins CI which has a password, or create some user(s) [15:44:33] oh, but hmm [15:44:50] actually, inflatador had additional menus in archiva though, which means it must be seeing his ops group membership :S [15:45:16] Guillaume also has the extra menus IIRC [15:45:49] hmm, so that implies it does work (partially), but i couldn't make it work locally :P [15:56:08] I'll be 8' late for triage [15:56:55] ebernhardson do I have additional menus in archiva? I can't remember checking [15:57:21] inflatador: oh, maybe i was remembering gehel. By extra i mean there is an admin menu to the left with ~15 or 20 options [15:57:31] what's the login to https://archiva.wikimedia.org ? Wikitech I'm guessing? [15:57:39] inflatador: ldap [15:57:51] * ebernhardson forgets where all ldap is used though :P [15:58:08] I'm not even sure I have those creds [15:58:17] everyone has an ldap password [15:58:57] ebernhardson ACK, I have it saved as "wikitech" [16:00:16] Here's what my archiva dashboard looks like: https://ewr1.vultrobjects.com/work/archiva.png [16:00:37] inflatador: yup, thats the full admin menu, so it's seeing your `ops` group membership [16:00:56] and curiously, archiva-deployers and ops are defined basically exactly the same way in the config, just with a different role attached [16:02:10] apt, missing triage [16:02:31] pfischer: triage meeting: https://meet.google.com/eki-rafx-cxi [16:41:25] ebernhardson: + inflatador: so am I getting this right: In addition to the LDAP cn, I have to show up in an archiva-interal DB? [16:42:49] pfischer: no, i don't think so. The archiva internal db isn't supposed to be the primary login, it's just that both the db and ldap are configured, users between the two are independant [16:43:22] pfischer: i was thinking that the problem is that archiva is not able to query the group membership out of LDAP, but with ops being correctly detected that's not it. Basically I turned up nothing useful :( [16:43:39] Okay, do we know any user that is able to deploy to archiva? Does it show upload logs to admins? [16:44:20] pfischer: best guess right now is to have ryan or brian run the deploy, or configure a job in jenkins and use the credentials there [16:44:30] since the ops group membership seems to be getting access [16:44:49] i suppose that shouldn't be too hard, it mostly amounts to have an appropriate java version installed and run a command or two [16:45:42] pfischer: also, i meant to put this in a review but somehow didn't submit it, it looked like in the new bulk listener it was only counting the bad requests but not bypassing the failure mechanism? I might have just missed that bit [16:48:19] ebernhardson: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/68/diffs#a79b9bb8aadef1a966b668db9b30b3b82b649dd7_0_104 inspect method just bumps counters. It does not fail if the response failed (partially) [16:49:31] pfischer: but the code being removed, BulkProcessorListener::afterBulk, was making sure the failures didn't cause the rest of the pipeline to fail [16:49:56] well, specific failures [16:52:02] Hm, that was necessary because we still invoked the bulk listener that was part of the elasticsearch-connector. With the patched version, this bulk listener delegates to the response inspector, which we replace (not wrap) [16:52:05] https://github.com/apache/flink-connector-elasticsearch/pull/83 [16:52:13] pfischer: oh! i didn't realize that [16:52:47] i guess didn't notice the changes from david's initial proposal, will read this [16:54:43] Sorry, should have linked the upstream patch in my gitlab PR. So far, I didn’t get any feedback from the apache folks though (see https://issues.apache.org/jira/browse/FLINK-32028) [16:56:56] Are the whereabouts of the patched connector-base artefact blocking the metrics PR? (Would we just move one, once it lives in archiva?) [16:57:14] S/one/on/ [16:57:21] seems reasonable, i wonder if we need to do something to get noticed on the flink side, when things are split into many repo's i never know if people are actually watching them [16:57:57] for example, with php submitting a patch is likely to be ignored, have to ping the php-internals mailing list. Maybe flink has similar? [16:58:43] pfischer: the exact location shouldn't block anything, as long as the build can source jars it should be fine [16:59:38] Good. On gitlab the PR got noticed by the original author of the JIRA ticket. But I’ll look for another channel to ping them. dcausse: you wrote to their mailing list some time ago, right? [17:05:34] pfischer: re kafka, from the ticket it looks like partitioning was done, and we are skipping the compaction for now? [17:05:50] Errand, back in ~40 [17:06:22] in that case should be able to ship the refresh events today, and will try and separate out an estimate from the previous round for just these wikis so we can compare the estimate to the reality [17:09:43] scheduled for the window ~4h from now [17:37:56] hmm, disabling writes also disables the saneitizer. Fine for now, but might want to think about what's appropriate there [17:54:55] huh, i hadn't noticed this runner group in gitlab before: Cloud Runners, running in Digital Ocean K8s [17:56:40] back, but I'm gonna go ahead and take lunch as well...back in ~30 [18:18:18] hmm, looks like the test deployment will be something like 22% of the expected refresh events, probably reasonable for a ramping up deploy or should we cut something to get more like 10% ? [18:19:03] hard to tell... [18:19:59] that's around ~60 evt/s, seems reasonable to me [18:20:09] back [18:20:45] yea 60 doesn't seem too terrible. It would be ~14G per broker [18:22:05] did we increase the partition count? curious to see if flink auto discovered those [18:23:29] hmm, i suppose i can ask kafka directly, sec [18:24:21] doh, no it hasn't changed yet. I misunderstood the ticket :( In that case we really can't expand it yet [18:42:48] pfischer: missed your question, yes I'm subscribed to their ML, they also have a slack perhaps try to ping a few folks there to get some attention? [18:58:59] gehel no rush, but I've got a puppet CR that would probably benefit from your feedback https://gerrit.wikimedia.org/r/c/operations/puppet/+/979983 [19:11:46] inflatador: looking [19:26:35] test test...having bouncer issues [19:26:45] inflatador: you made it back [19:27:44] for the moment ;) [19:28:22] changing pw, brb [19:29:58] randomly interesting, i pulled a count of cirrus links update job counts over the last week: https://phabricator.wikimedia.org/P54127 [19:32:09] i always forget how busy enwiktionary is [19:45:23] OK, that fun is over [19:45:42] hmm, deployed updated container for updater, can see new metrics in flink ui. i added the 5 buckets for rev_based_update to a graph, but they are all showing zero's so far. [19:46:02] awesome to your first sentence! [19:46:41] :) Will let it run for a bit, i think these graphs don't have any history but rather show the values since added to the ui. I suppose will let that run in the background while figuring out if this made it to prometheus [19:49:03] Appointment time! Back in ~90m [19:57:46] one graph went up to 1, so something is getting through :) I suspect something is missing though, the overall metrics report 568 events recieved, at least half since i added the graphs, but only 1 increment here [19:58:36] * ebernhardson will have to look at the patch closer [20:47:15] hmm, if these graphs are right the minimum available diskspace on a kafka-main broker is ~2TB. should be fine with 25% [20:49:13] ryankemper: if you have a minute, per https://phabricator.wikimedia.org/T351503#9378373 we can proceed with only repartitioning. Need someone to apply https://wikitech.wikimedia.org/wiki/Kafka/Administration#Alter_topic_partitions_number to kafka-main in both DC's [20:50:54] topic is {eqiad,codfw}.mediawiki.cirrussearch.page_rerender.v1 [20:51:27] ebernhardson: yeah I can do that, what number should the partition # be set to? [20:52:49] ryankemper: oh, good question :) sec [20:53:40] ryankemper: looks like task is for 5 partitions, basically the same as the number of brokers [20:58:28] ebernhardson: ack, and I presume I'll be running this on one of the `kafka-main` hosts? [20:59:14] ryankemper: techinically it doesn't matter, it talks over the network. But the actual command line tools only exist on kafka servers (and by default talk to localhost), so yes. [20:59:21] Commands will be as follows: [20:59:33] `kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5` [20:59:33] `kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5` [20:59:43] i believe so, yes [21:02:21] ebernhardson: okay, I'm ready to proceed if you are. We should be able to do something like `kafka configs --entity-type topics --describe | grep codfw.mediawiki.cirrussearch.page_rerender.v1` to validate the change afterwards [21:02:50] Oh I can avoid the grep by using `--entity-name` i think [21:03:35] i was using kafkacat -b kafka-main1005 -L | grep page_rerender [21:04:13] will also be curious to see if flink notices, i suppose could have tested locally but not super important here, can always restart flink [21:05:14] ebernhardson: okay, proceeding [21:07:12] ebernhardson: eqiad done. shall we check if flink-land sees the change? or should I just do codfw right away [21:08:25] ryankemper: go ahead and do both [21:08:47] ebernhardson: done [21:09:29] see the changes in kafka, flink doesn't seem to have noticed. will have k8s restart it [21:12:35] ebernhardson: remind me which k8s cluster those live in? Is it `staging-eqiad`? [21:12:53] I was thinking maybe `dse-k8s` but that doesn't seem to have a cirrus-streaming-updater ns [21:12:53] * ebernhardson curiously doesn't see restart options in helmfile :P [21:13:08] ryankemper: should be staging-eqiad [21:13:14] kube_env cirrus-streaming-updater staging [21:14:14] Cool, I'm looking at k8s right now. Usually deleting the pods is the k8s way of doing things but I'm guessing we're supposed to have a consumer pod per partition [21:14:24] So maybe we need to do a full teardown/reapply to get it to recognize that? [21:15:16] not really, because helm only defines the top level instances with hashes in their name, the taskmanager instances are "on demand" via the k8s operator [21:15:28] in theory those taskmanagers simply get created if flink decides it needs them [21:16:02] i suppose a full teardown/reapply can't hurt too much [21:16:15] but someday we will need better plans than that i imagine :) [21:16:31] definitely :P [21:16:51] oh, i need to make a change if we do that. we have `kafka-source-start-time: 2023-11-16T19:20:00Z` in the staging config, and that only does anything on a full destroy/restart [21:17:03] in which case, i guess not even destroy just apply the thing to remove that cli arg will restart [21:19:00] ebernhardson: Does the helmfile define the operator? Or how can we "restart" just the operator [21:19:06] (not sure if restart is really the right word here) [21:19:32] back [21:19:43] ryankemper: the operator is implemented by this repo, tbh i haven't looked too closely but there is a lot there: https://github.com/apache/flink-kubernetes-operator [21:20:32] ryankemper: and then we declare that we are using that by using the flink-app chart for our service [21:21:15] but, i guess we don't really restart the operator itself. TBH i don't know where exactly the operator even runs (is it in our containers, loaded into flink? or some control plane?) [21:22:05] but on our side we have jobmanagers, which talk to the operator, the jobmanager instances are the ones with hashes in their name and request taskmanager instances via the operator [21:22:25] so we basically define the jobmanagers, and then magic happens :) [21:24:03] My hunch is control plane ie that the operator is communicating with the apiserver [21:24:51] The operator runs in the system namespace [21:25:15] https://github.com/wikimedia/operations-deployment-charts/tree/master/helmfile.d/admin_ng/flink-operator [21:25:18] ahh, ok that makes sense [21:25:34] So maybe we just need a `kubectl delete pod/flink-app-producer-taskmanager-1-1`? [21:26:08] y'all were right, the operator basically listens for a namespace and (theoretically) handles the lifecycle of the flink pods [21:26:43] ryankemper: probably the jobmanager is the one that has to restart, it's the coordinator and would be deciding what topics to read. I'll restart it in 1 sec [21:27:01] * ebernhardson is in parallel doing the mw backport window now, since noone showed up to run it and i have a config patch [21:27:32] In theory you should be fine killing a pod, operator should make a new one for you. If I can help LMK, I got a lot of k8s reps doing the rdf-streaming-updater stuff [21:28:03] i've just applied the change to remove the extra cli arg, that ends up recreateing everything [21:28:33] oh doh, i wasn't thinking far enough. That only changed the consumer. So good removed the thing we don't need, but it didn't restart the producer :P [21:29:00] Shotgun approach FTW ;P [21:29:20] ryankemper: need to apply the same topic settings on both topics from codw kafka-main [21:29:22] *codfw [21:30:18] hmm, i guess would be curious what helm/k8s/flink operator do when deleting the pod. Could try deleting the flink-app-producer. It only writes to relforge so no harm [21:34:00] did helmfile consider your last changes as a no-op (for the producer, that is)? [21:35:07] inflatador: yes, it only restarted the consumer [21:35:39] so i deleted the producer, and it stood up a new one. [21:35:52] curiously the taskmanager didn't get restarted with it [21:36:31] ebernhardson maybe the task manager can just hang around and wait for jobs? [21:37:15] maybe, it's curious :) it does seem like there is no strict reason to need a new taskmanager instance depending on how things are implemented [21:37:28] the taskmanager just accepts tasks given to it and runs them [21:37:45] I dunno about the producer being considered a no-op...that sounds like a problem w/helmfile logic maybe? Like we might have to define something in two places on the chart? [21:38:36] curiously i don't see page_rerender mentioned anywhere in the logs :S but that might be something else. will ponder [21:39:07] np [21:45:16] ebernhardson: oh right I only ran the commands on kafka-main1001. So I need to do the same on kafka-main2* yeah? [21:45:27] ryankemper: yup, they are independent [21:47:02] ebernhardson: done