[07:41:12] <dcausse>	 gehel: I think we should reply to Léa (https://www.wikidata.org/wiki/Wikidata_talk:Events/Data_Modelling_Days_2023#c-Lea_Lacroix_%28WMDE%29-20231106080100-GLederrey_%28WMF%29-20231103164700)
[07:41:59] <dcausse>	 inflatador: nice! the rdf-streaming-updater@staging is listening to kafka-main not kafka-test so I think that the allowed kafla cluster should reflect this
[08:26:48] <gehel>	 dcausse: does the time proposed by Léa works for you (2pm CET)
[08:26:55] <dcausse>	 gehel: yes
[10:40:15] <dcausse>	 errand+lunch
[13:54:33] <pfischer>	 dcausse: I started to crunch numbers of running the flink aggregator against kafka-main (https://docs.google.com/spreadsheets/d/1Fp44MdLxUVlxi03MBD_64m0zQErny-9jUD5C6RGf_bU/edit#gid=1007644065) and noticed that we do not consume page_rerender. According to https://stream.wikimedia.org/v2/ui/#/ the stream does not exist yet. Do we need a ticket to move this forward?
[13:55:52] <dcausse>	 pfischer: it's enabled (https://meta.wikimedia.org/w/api.php?action=streamconfigs) but only for testwiki
[13:56:06] <dcausse>	 https://stream.wikimedia.org/v2/ui/#/ is only for publicly available stream IIRC
[13:57:21] <dcausse>	 the reason why I enabled this only for testwiki is that the page_rerender will be one of the "large topic" and we told serviceops that we will gradually ramp up on this one while we remove traffic from older CirrusSearch topic
[13:57:55] <dcausse>	 if we enable all wikis for page_rerender then we might use more space on kafka-main than what we told them
[14:00:20] <inflatador>	 <o/
[14:00:25] <dcausse>	 o/
[14:01:19] <dcausse>	 perhaps we can ask them if it's OK to enable it for all wikis, or possibly you could write a small wrapper that reads the existing CirrusSearchLinksUpdate topic
[14:13:05] <inflatador>	 rdf-streaming-updater is talking to Kafka, but we have some new errors. Might still be network-related, maybe HTTP routes https://logstash.wikimedia.org/goto/4292ec2974c3f3c8cbae562e88edebf3
[14:18:58] <inflatador>	 hmmm, maybe not? `java.lang.IllegalStateException: Cannot receive event after a delete event received`
[14:29:27] <inflatador>	 I see a lot of warnings like `"The configuration 'client.id.prefix' was supplied but isn't a known conf
[14:29:27] <inflatador>	 ig`
[14:36:57] <pfischer>	 inflatador: those warnings come from kafka, when different clients are created (read/write/admin) but all get the same set of options (to reduce number of config options to be passed in)
[14:37:47] <inflatador>	 pfischer ah, then probably not important
[14:40:32] <gehel>	 errand, back in 20'
[14:43:52] <dcausse>	 inflatador: your logstash query seems too broad and capture things that are not related to the wdqs updater
[14:44:50] <dcausse>	 e.g. java.lang.IllegalStateException: Cannot receive event after a delete event received is from the k8s-operator
[14:45:05] <dcausse>	 not related to kafka
[15:03:00] <inflatador>	 Y, I wonder if that is related to ZK data corruption...just a guess, still looking
[15:08:46] <inflatador>	 bug here with zero useful info ;( https://issues.apache.org/jira/browse/FLINK-32093
[15:15:29] <dcausse>	 seems like flink can't talk to k8s
[15:31:34] <inflatador>	 if it is k8s API access, maybe the flink-deploy user is missing perms that are implicit to other users...
[15:48:19] <dcausse>	 I don't see the egress rule to 10.64.16.203 (kubestagemaster1001.eqiad.wmnet)
[15:55:23] <inflatador>	 OK, let me give that a try
[15:58:22] <dcausse>	 they should be pulled automatically from .Values.kubernetesMasters (not sure where that is defined), the cirrus updater does not declare anything related to this...
[15:59:12] <dcausse>	 or perhaps that's the operator that is meant to create these network policies on the fly
[16:00:26] <dcausse>	 now I see them... nvm
[16:02:20] <dcausse>	 does not want to connect to zk now, missing egress rules again
[16:02:36] <inflatador>	 Y I think I had a YAML spacing issue, one sec
[16:04:51] <ebernhardson>	 \o
[16:05:20] <dcausse>	 o/
[16:05:47] <inflatador>	 <o/
[16:07:14] <ebernhardson>	 hmm, producer still crashed out :(  something with a blank wiki id
[16:10:03] <dcausse>	 ebernhardson: does this ring a bell these egress net policies from the flink pods to kubemaster api servers?
[16:10:33] <ebernhardson>	 dcausse: hmm, i don't think i remember doing anything with those specifically
[16:11:09] <dcausse>	 ok, still some helm magic we don't yet fully understand
[16:11:44] <ebernhardson>	 it should all be in the vendored bits
[16:11:58] <ebernhardson>	 so far i haven't seen any proper magic like in ruby, everything is strictly defined
[16:13:15] <dcausse>	 well for me .Values.kubernetesMasters is still mysterious :) can't find from where they come from
[16:14:35] <dcausse>	 ah they're only in admin_ng
[16:14:35] <ebernhardson>	 hmm, indeed thats not super obvious at first glance :)
[16:15:36] <dcausse>	 so it must be the k8s-operator pushing those policies to the client namespace?
[16:16:12] <ebernhardson>	 that seems plausible, they are certainly defined in admin_ng
[16:20:46] <ebernhardson>	 dcausse: yea that seems to be whats happening, looking at the cirrus version the rendered networkpolicy for talking to kubestagemaster1001 is from the flink-kubernetes-operator templates, which is only used in admin_ng
[16:22:19] <ebernhardson>	 dcausse: important to note admin_ng is a submoduled thing, unlike the regular services which are deployed independenalty. I don't entirely know what that means, but it does mean that the flink-kubernetes-operator is reading the staging-eqiad/values.yaml in admin_ng and thats where kubernetesMasters is comeing from
[16:23:39] <ebernhardson>	 as for how that doesn't end up in the deployed thing .... i guess some day i will have to figure out what flink-kubernetes-operator actually does
[16:31:50] <ebernhardson>	 dcausse: if i do a helmfile diff from brian's copy of deployment-charts, i see the appropriate ip in networkpolicy. but curiously it's only ipv4 and not ipv6 like the cirrus one, it also has less ports opened to kubestagemaster
[16:34:27] <ebernhardson>	 i dunno :(
[16:35:00] <pfischer>	 ebernhardson: I have my local copy of the producer running against kafka-main for 7.5h w/o any hickups
[16:36:23] <ebernhardson>	 pfischer: yea, i ran mine against 4 days of input from kafka-main and it seemed to work, i was optimistic this was going to keep running overnight :)
[16:38:50] <pfischer>	 Do you have a stacktrace?
[16:39:45] <ottomata>	 ^^ qq, are you guys sure you want to run local consumer tests from kafka-main?  topics should be replicated to kafka-jumbo...might be safer to do it from there?
[16:39:59] <ebernhardson>	 pfischer: https://phabricator.wikimedia.org/P53150
[16:40:37] <ebernhardson>	 ottomata: can if you prefer, this is a read-only bit
[16:40:54] <ottomata>	 true, but even so, if something goes wrong and you overload kafka main, you gonna break wikipedia :)
[16:41:12] * ebernhardson only broke kafka once :P
[16:41:21] <ebernhardson>	 but fair, i'll switch it over
[16:41:37] <ottomata>	 ty, better safe than sorry! :D
[16:42:30] <ebernhardson>	 pfischer: so the curious bit of that event is it looks completely empty
[16:42:53] <pfischer>	 Yes, only one timestamp looks like an actual value
[16:43:06] <ebernhardson>	 and i dont know why it has 2021-02-03T17:47:57.393Z as a timestamp, awfully specific but obviously not an expected value for anything
[16:43:37] <ebernhardson>	 the 2023-11-06 timestamp is probably the ingestion 
[16:43:49] <pfischer>	 Could be the original page creation time?
[16:44:51] <ebernhardson>	 Do we ever get that value from an event? I guess some events contain a serialization of the previous revision, but i don't know we used that for anything
[16:45:16] <gehel>	 ebernhardson: are we ok with the new search-lopders? Could you confirm on T346039?
[16:45:19] <pfischer>	 But they should not show up in our internal update events
[16:45:44] <ebernhardson>	 gehel: i haven't looked at all yet, will look into it
[16:46:16] <ebernhardson>	 graphs look plausible
[16:47:23] <stashbot>	 T346039: Migrate search-loader hosts to Bullseye or later - https://phabricator.wikimedia.org/T346039
[16:48:17] <pfischer>	 ebernhardson: the fact that it doesn’t event contain a cirrussearch_cluster_group should have prevented it from being transcoded into a Row before it makes it to the kafka sink
[16:54:54] <ebernhardson>	 hmm, maybe i can get the consumer offset the app is at and restart locally from there...
[16:55:51] <inflatador>	 explicitly adding the egress rules for kubestagemaster seems to have overridden other FW rules-going to look at RBAC stuff again
[16:56:41] <ebernhardson>	 inflatador: i'm also a bit suspicious of the RBAC thing and wonder if we are missing something somewhere, i suppose the suspicious part is cirrus-streaming-updater, and rdf-on-dse both didn't seem to need it. Seems like there is some detail we are missing
[16:57:04] <inflatador>	 yeah agreed
[16:58:05] <inflatador>	 The flink-deploy user is unique in RBAC...it has ostensibly has more permissions, but it seems like implicit permissions that other deploy users have might be missing
[16:59:29] <inflatador>	 but cirrus-streaming-updater uses flink-deploy too, right? Hmmm
[17:03:57] <ebernhardson>	 inflatador: random guess, any chance something funny in /etc/helmfile-defaults/private/main_services/rdf-streaming-updater/ ?
[17:04:34] <inflatador>	 Adding the egress rules for k8s staging API definitely helped. We're getting taskmanager pods now
[17:05:15] <ebernhardson>	 oh, actually i can read those files.  Nothing too suspicious :(
[17:06:25] <inflatador>	 Now I'm getting quota-related errors. Progress!
[17:06:35] <inflatador>	 Workout now but will take a look in ~40
[17:54:13] <inflatador>	 back
[18:03:20] <pfischer>	 ebernhardson: I tried to look up the consumer offsets for my local test consumer but the admin client only gives me empty results. Using kafka-consumer-groups.sh returns with onsumer group ‘EndoToEndIT’ does not exist. Since __consumer_offsets uses non-plain encoding, there’s no way to look them up using kafkacat. How do you get the offset your local test stopped at?
[18:07:33] <ebernhardson>	 pfischer: hmm, what i've done before is used the python library to connect a consumer to the group and print it's position. To work this needs to stop all other consumers
[18:08:04] <ebernhardson>	 sec
[18:13:09] <inflatador>	 hmm, looks like staging is still asking for production's amount of resource. I thought more specific (values-staging) would override less (values) but let me move the prod values out of base values
[18:14:14] <inflatador>	 ah, or just a YAML spacing issue ;(
[18:20:05] <ebernhardson>	 pfischer: something like this: https://phabricator.wikimedia.org/P53151
[18:28:31] <ebernhardson>	 there should certainly be a way to do that more elegantly, but i'm not sure how
[18:31:18] <inflatador>	 lunch, back in time for pairing
[18:57:36] <ebernhardson>	 not having any luck convincing my local to blow up like prod :(
[19:16:10] * ebernhardson goes ahead and destroy/recreate the prod app...going to pretend bad serialization got into the state and something weird happens :P
[19:17:55] <inflatador>	 back
[19:19:14] <ebernhardson>	 seems a plausible guess, it looks like the offsets stored in kafka were still used after resetting the app (makes sense), but it restarted from checkpoint 0 indicating it's not loading prior state
[19:20:21] <ebernhardson>	 i wonder if we should be using any kind of consumer throttling, or just letting it go when catching up. It's currently doing 3-5k events per something (i wish dashboards would say /sec or /min or whatever)
[19:20:21] <pfischer>	 Yes, they do get reused, I saw that locally too, you can forcefully override this via config option
[19:22:02] <pfischer>	 —kafka-source-start-offset.<topic-name[:partition]>=<offset>
[19:23:32] <pfischer>	 When resuming the back-pressure goes up to 100% for the window operator but after a few minutes it was back to 0%
[19:24:03] <ebernhardson>	 yea thats what it's doing now, because the updater was dead for a little while its ~11M events behind
[19:24:12] <ebernhardson>	 i suppose on the upside, it's filtering most of those out
[19:24:48] <ebernhardson>	 1.5M over the last 5 minutes, so it will run a bit longer
[19:25:00] <pfischer>	 Do you have a wiki filter active?
[19:25:07] <ebernhardson>	 this is in prod, it should
[19:25:29] <ebernhardson>	 so it's getting 5k/sec from kafka, but not nearly that many coming out into the merge phase
[19:26:23] <pfischer>	 Ah, I thought you’re running locally. I observed a reduction by 0.6 (window out / window in)
[19:28:02] <inflatador>	 staging commons seems to be making checkpoints, still waiting for state to transition though
[19:35:14] <ebernhardson>	 new errors! now the consumer falls over :)
[19:35:28] <inflatador>	 OK! commons is stable, wikidata won't deploy at all. Almost certainly a quota issue, checking that out
[19:35:54] <ebernhardson>	 shouldn't be too bad: Elasticsearch exception [type=class_cast_exception, reason=class java.util.ArrayList cannot be cast to class java.util.Map
[20:29:44] <gehel>	 ebernhardson: I'll be 3' late to our ITC (need a quick break)
[21:13:55] <pfischer>	 ebernhardson: regarding the comparison script: would you prefer a way of passing page IDs instead of bulking? Or would you start with a copy of the production index so it's not empty?
[21:16:25] <ebernhardson>	 pfischer: hmm, i don't quite follow
[21:21:41] <pfischer>	 ebernhardson: As it's implemented right now, the comparison script will iterate over all documents in index A and look them up/compare them with documents in index B
[21:22:51] <pfischer>	 The pipeline would only update/create documents for pages that have been change within the last 7 days
[21:23:46] <pfischer>	 So the pipeline-fed index will be smaller than the existing cirrus-fed one
[21:24:03] <ebernhardson>	 pfischer: ahh, i would 100% start with a copy of the prod index
[21:24:22] <ebernhardson>	 best case using elasticsearch snapshots to copy index from prod ->swift and then swift->relforge. Worst case by importing from dumps
[21:38:01] <inflatador>	 ryankemper heads up, restarting cloudelastic for T350703 . Have it running in a tmux window on cumin2002
[21:38:02] <stashbot>	 T350703: Restart Elasticsearch services for java 11 updates - https://phabricator.wikimedia.org/T350703
[21:38:20] * ebernhardson realizes now that the updater probably can't handle changing a redirect to a new target
[21:40:30] <ebernhardson>	 yea, the page change event doesn't tell us enough to remove the redirect from the old target
[21:41:34] <pfischer>	 ebernhardson: you mean, that we do not remove the redirect from the original target document?
[21:41:41] <ebernhardson>	 pfischer: right
[21:41:58] <ebernhardson>	 we don't know what the original target document was when it changes
[21:42:28] <ebernhardson>	 this is event i just recorded: https://phabricator.wikimedia.org/P53159  The old target was "Example" but it's only mentioned in the revision comments
[21:42:31] <pfischer>	 I think that's supposed to be solved by updates/fetches based on rerenders
[21:42:36] <ebernhardson>	 ahh, ok
[21:43:36] <ebernhardson>	 i guess i'll have to look closer into when those run, i'm not sure if they trigger from redirect changes. Most rerenders are templates, but i'm sure there are other things in there
[21:44:24] <pfischer>	 Right now, only test wiki would fire them anyways
[21:44:49] <ebernhardson>	 oh right i need to turn that on for it to do anything in my dev
[21:47:40] <pfischer>	 Alright, I'll call it a day. I'll look into that JSON mapping issue in the updater tomorrow unless you already fixed it by then. 😉
[21:47:57] <ebernhardson>	 pfischer: i have a fix almost ready to push, it was missing redirect noop hints
[21:54:52] <inflatador>	 ebernhardson do you remember where/how you allowed access to kubemaster IPs for cirrus-streaming-updater? I'm directly adding the IPs to values.yaml ATM and I feel there's a better way?
[21:55:24] <ebernhardson>	 inflatador: i didn't add them directly. They should be rendered from the admin_ng side
[21:56:45] <ebernhardson>	 inflatador: if my reading is right, the charts/flink-kubernetes-operator/templates/networkpolicy.yaml is where it should come from, the "flink-operator.k82-egress-rule" template. 
[21:57:20] <ebernhardson>	 at least, that template matches what i see in the cirrus-streaming-updater networkpolicy when i look for the kubemaster ip
[21:58:30] <inflatador>	 I see it for the operator, but confused why it doesn't work for the rdf-streaming-updater app itself. I couldn't get it to provision pods until I explicitly allowed the kubemaster IPs in the values.yaml for rdf-streaming-updater
[21:59:34] <ebernhardson>	 inflatador: it suggests to me something isn't set, but i'm not yet sure how it gets from the operator into the app
[22:00:14] <inflatador>	 Yeah, I see some charts that have it set, but not cirrus or rdf
[22:10:32] <ebernhardson>	 some reading suggests the injection of networkpolicy from admin to charts is via a webhook...but still not sure about how :P
[22:14:49] <ebernhardson>	 or not 
[22:17:04] <inflatador>	 all good...I'll hardcode for now. After this patch/migration there will be plenty of opportunities for cleanup ;)
[22:24:11] <ebernhardson>	 inflatador: i don't know how to solve, but I can say for certainty the admin_ng side is injecting a "flink-pod-k8s-api" networkpolicy that allows talking to k8s master in cirrus and somehow it's not being injected to the rdf one
[22:25:45] <ebernhardson>	 the only obvious thing from the templates themselves would be that either egress is disabled or kubernetesMasters is unset, but i dont see how those vary between the two deployments :(
[22:31:06] <inflatador>	 looks like kubernetesMasters is set in helmfile.d/admin_ng/values/staging-eqiad/values.yaml among other places, maybe I need to pull that in to the chart
[22:41:20] <inflatador>	 I think we're going to need a bigger RAM quota for RDF too, since we have separate releases for commons and wikidata now