[08:22:54] <dcausse>	 pfischer: o/, re using local disk, might perhaps be possible but then we have to tell k8s to not recreate a pod when the job restarts and/or use persistence volumes, that part seems harder
[08:23:13] <dcausse>	 on the hand it should be possible to download a checkpoint/savepoint and investigate locally
[08:23:46] <dcausse>	 *other
[10:53:35] <dcausse>	 lunch
[13:06:31] <pfischer>	 dcausse: thanks, I attached my local instance of the flink application to the kafka-main topics and can reproduce the errors described by Erik. Since I do not use S3/swift locally, the cause must be something different.
[13:07:36] <dcausse>	 pfischer: nice! curious to know what's the cause :)
[13:09:55] <pfischer>	 Me too, it’s interesting to see the app run against production loads.
[14:15:07] <inflatador>	 <o/
[14:15:59] <dcausse>	 o/
[14:29:55] <gehel>	 pfischer: (and others): the SUP sync meeting tomorrow conflicts with the Airflow deep dive. Do we still need this sync? Should we skip tomorrow? Or cancel completely and call a meeting when there is a need?
[14:33:48] <gehel>	 ebernhardson: I'm moving our 1:1 30 minutes later to make some space. Let me know if that's hurting your lunch break and I'll find some other time.
[15:08:16] <pfischer>	 gehel: The way I see it, there are currently no blockers that need a meeting like this to unblock them. We are allowed to use kafka-main + swift & zookeeper for state handling. inflatador: Do you see any blockers, that we could use help with?
[16:00:20] <inflatador>	 ebernhardson we're deploying rdf-streaming-updater w/flink-operator for the 1st time. Getting perm errors around creating flinkdeployments: https://phabricator.wikimedia.org/P53141 . Did you have this problem with cirrus-streaming?
[16:01:28] <ebernhardson>	 inflatador: nope, no problems there
[16:01:37] <gehel>	 pfischer: triage meeting: https://meet.google.com/eki-rafx-cxi
[16:22:53] <dcausse>	 blazegraph@wdqs1014 seems stuck with a very high threadcount and might need a restart 
[16:31:11] * ebernhardson wishes phabricator understood UTC means i don't care about timezones
[16:34:30] <inflatador>	 rebooted wdqs1014 , watching lag
[16:37:38] <ottomata>	 inflatador: maybe you need to apply on admin_ng, if you recently added rdf-streaming-updater to staging watch namespaces?
[16:56:12] <ebernhardson>	 pfischer: so i spent better half of friday investigating that serialization issue...and other than attaching to kafka-main don't feel like i got anywhere
[16:57:08] <ebernhardson>	 i can keep looking into it, my current suspicion is still somewhere along the lines of multiple threads conflicting, perhaps writing to the same MemorySegment object or some such
[16:58:31] <inflatador>	 break, back in ~30
[17:07:08] <dcausse>	 is the serilization issue always after the "rawFields" field?
[17:08:34] <dcausse>	 might be a stupid question, I guess the stack might not be sufficient to determine what it was reading
[17:12:15] <ebernhardson>	 dcausse: hmm, i'm not sure about the order there it's hard to tell
[17:12:34] * ebernhardson feels like the elasticsearch serialization scheme, while more verbose, is quite a bit simpler :P
[17:13:30] <pfischer>	 dcausse: no it's early on. first change type, event time, ingestion time. This is where it breaks
[17:13:52] <ebernhardson>	 another curious thing, it doesn't fail the same way every time. There are 3 or perhaps 4 failures it seems to almost randomly choose
[17:14:00] <dcausse>	 yes... that's my impression too, whenever you try to make things serialize "automagically" you hit a wall unless you use really simple model classes
[17:14:49] <pfischer>	 fields are sorted alphabetically by the pojo serializer BTW
[17:14:55] <ebernhardson>	 i suppose on the upside, those failures it rotates through seem to repeat in the same way, so it's probably at least partially deterministic
[17:18:08] <pfischer>	 ebernhardson: another thing:  I noticed canary events slipping through, see MR
[17:19:17] <ebernhardson>	 pfischer: curious, the event had a domain other than canry?
[17:19:55] <pfischer>	 Obviously, at least it showed up way too late in the graph
[17:20:29] <ebernhardson>	 i wonder how that happens :S
[17:20:51] <ebernhardson>	 i guess maybe it's because of the different input schemas, and they don't all define their canary the same way?
[17:22:36] <ebernhardson>	 hmm, the schema examples don't say canary directly, that must be applied somewhere else (perhaps the producer). Anyways it's probably fine, I'm just curiious where the variation comes from
[17:22:58] <pfischer>	 I thought they are created dynamically by event gate. Wording is my continuous state since attaching to the proxied brokers
[17:23:26] <ebernhardson>	 i was under the impression it takes the example event from the schema and produces it, but i suppose i'm not sure
[17:23:35] <pfischer>	 s/wording/wondering
[17:43:34] <inflatador>	 back
[17:50:41] <inflatador>	 ottomata I deployed admin_ng on staging, are you saying I might need to apply it to prod envs too?
[17:52:24] <ottomata>	 inflatador:  no, if you are only deploying now in staging don't need to do prods.  mostly was asking if you did what you already did :)
[17:59:05] <inflatador>	 the flink-operator recognized that it's watching the rdf-streaming-updater namespace as of 5 minutes ago. Only thing I did was delete the networkpolicy associated with staging rdf-streaming-updater...that's caused problems in the past
[18:07:22] <ebernhardson>	 curious, i deleted the second event from UpdateEventConverters.fromPageChange and it stopped failing (but i also deleted like 15 other things :P time to put some of it back)
[18:10:54] <ebernhardson>	 so, if i change the second event to create a new TargetDocument, instead of reusing the one from sourceEvent, it seems to stop having serialization issues.
[18:11:17] <dcausse>	 oh
[18:12:22] <ebernhardson>	 i can't explain why though :P It seems reasonable to me and TargetDocument is a static subclass so it shouldn't be holding weird state
[18:12:33] <ebernhardson>	 err, s/subclass/child class/
[18:14:37] <dcausse>	 you mean if you make a copy at: ImmutableSet.of(sourceEvent.getTargetDocument())
[18:14:47] <ebernhardson>	 dcausse: yea, use toBuilder().build()
[18:14:57] <dcausse>	 hm...
[18:16:18] <ebernhardson>	 it then fails after about a minute about not finding fiwiki, but i think thats unrelated
[18:16:51] <dcausse>	 java serialization would try to map that to the same instance if they happen to be serizalized in the same stream, I don't think that flink Pojo has this kind of magic
[18:17:28] <dcausse>	 indeed not sure to understand yet why having the same instance could cause weird issues here
[18:18:30] <ebernhardson>	 hmm, no maybe i spoke too soon. it did run a few times for longer, but now it's back :S
[18:19:05] <ebernhardson>	 but commenting out events.add(targetEvent) seems to let it continue on
[18:22:44] * ebernhardson isn't really making heads of tails of this, just deleting things and seeing what happens :P
[18:25:49] <dcausse>	 ah StreamSupport.stream(converter.apply(row).spliterator(), true)
[18:25:57] <dcausse>	 it's parallel
[18:26:46] * ebernhardson remembers a recent HN thread about java stream apis, and how parallel streams are mostly for creating bugs :P
[18:26:53] <dcausse>	 collector.collect might not like that
[18:26:59] <dcausse>	 :)
[18:27:34] <ebernhardson>	 so, if collect.collect isn't thread safe then both things could get serialized into the output 
[18:27:40] <dcausse>	 you guessed right but spotting that with the stream api is definetely hard
[18:27:41] <ebernhardson>	 overwriting each other
[18:27:42] <dcausse>	 yes
[18:28:50] <ebernhardson>	 so indeed setting that to false seems to be running now. woo!
[18:29:00] <dcausse>	 :)
[18:29:22] <ebernhardson>	 then it fails because it thinks fiwiki is a hostname instead of a domain name, something wrong in our namespace lookup but thats fine
[18:29:35] <ebernhardson>	 err, it thinks fiwiki is a domain name instead of a database name
[18:30:08] <dcausse>	 seems more like the kind of bug we expect :P
[18:30:55] <ebernhardson>	 indeed, this i can figure out in an hour or two i imagine :)
[18:31:05] <inflatador>	 OK, I think this will fix the k8s RBAC stuff for rdf-streaming-updater https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972005
[18:33:29] <dcausse>	 inflatador: seems reasonable but I have no clue how all this works...
[18:33:40] <dcausse>	 adding Andrew to the patch
[18:35:01] <dcausse>	 oh that's because we had a special "deploy-flink" user
[18:35:23] <inflatador>	 got a +1 from j-ayme as well
[18:35:50] <inflatador>	 ottomata if you have any feedback on ^^ let us know
[18:50:50] <inflatador>	 OK, rdf commons and wikidata releases are deploying in staging
[18:52:11] <inflatador>	 getting some jaas/auth errors 
[18:52:52] <inflatador>	 `"ERROR","message":"Authentication failed", "ecs.version": "1.2.0","process.thread.name":"main-EventThread","log.logger":"org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState"}`
[18:53:07] <inflatador>	 It's also warning about ZK auth, but we don't have any so that should be OK
[19:02:13] <inflatador>	 ZK created the new znodes for flink-app-commons and flink-app-wikidata no problem
[19:20:41] <inflatador>	 hmmm https://support.huaweicloud.com/intl/en-us/dli_faq/dli_03_0165.html
[19:26:56] <ebernhardson>	 ran a few days of events through the producer on my laptop, doesn't fall over anymore :) unclear if things are correct, but better
[19:27:54] <inflatador>	 https://phabricator.wikimedia.org/P53141#215081 updated with the flink error...no clue what it's trying to auth to that's causing the failure
[19:28:18] <inflatador>	 guessing it's more kubernetes RBAC stuff
[19:43:05] <dcausse>	 inflatador: this only auth that I know of is swift, might be that the swift key is not in the right place (see how secrets are mapped in /etc/helmfile-defaults/private/main_services/rdf-streaming-updater/staging.yaml) 
[19:43:48] <inflatador>	 dcausse ACK, will take a look at that
[19:44:28] <ebernhardson>	 i looked over those errors, but it's not clear to me if thats a problem. Unfortunately it's common practice in java to log stack traces that aren't a problem. Is this under `kube_env rdf-streaming-updater dse-k8s-eqiad`, or how would i find the containers?
[19:44:47] <inflatador>	 ebernhardson `kube_env rdf-streaming-updater staging`
[19:45:40] <inflatador>	 flinkdeployments are hanging in `RECONCILING` status
[19:48:27] <inflatador>	 might need to add https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/flink-session-cluster/templates/_flink-conf.tpl#L28 to the flink-app chart
[19:48:29] <ebernhardson>	 hmm, it does seem to really be insisting on authenticating with zookeeper
[19:49:11] <ebernhardson>	 well, hmm
[19:49:48] <dcausse>	 inflatador: more the other way around by making /etc/helmfile-defaults/private/main_services/rdf-streaming-updater/staging.yaml looks like /etc/helmfile-defaults/private/dse-k8s_services/rdf-streaming-updater/dse-k8s-eqiad.yaml
[19:50:37] <dcausse>	 rename config.private.swift_api_key to app.flinkConfiguration.s3.secret-key 
[19:50:43] <ebernhardson>	 inflatador: i don't see the zk hosts in `kubectl get -o yaml networkpolicy`
[19:50:49] <inflatador>	 I dunno about ZK, thinking we would have seen this in dse-k8s if that was the issue
[19:50:56] <ebernhardson>	 inflatador: need a section that defines the allowed zk clusters
[19:51:13] <inflatador>	 I can also see the znodes being created as soon as I deploy the chart
[19:53:16] <ebernhardson>	 inflatador: i think this is what you need: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972014
[19:55:09] <inflatador>	 ebernhardson sorry for the confusion, the changes aren't merged. We do need that but it's active as we're deploying out of a patchset that does have that
[19:55:14] <inflatador>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967229
[19:56:21] <ebernhardson>	 inflatador: hmm, where are you deploying from? Not seeing it in deployment-charts on deploy2002
[19:56:46] <inflatador>	 ebernhardson cloned the patched version to my homedir on deploy2002
[19:59:21] <ebernhardson>	 hmmm
[20:00:16] <ebernhardson>	 inflatador: it's gotta be the networkpolicy, compare the output for cirrus updater vs rdf.  The question would be why it isn't being used...is egress enabled?
[20:01:42] <ebernhardson>	 inflatador: you are missing this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/cirrus-streaming-updater/values-main.yaml#61
[20:04:18] <inflatador>	 ebernhardson interesting. I'm kinda dubious since I see it making the znodes immediately after deploy, but on the other hand that should be explicit somewhere...let me amend my patch
[20:04:28] <ebernhardson>	 inflatador: thats probably the admin_ng side?
[20:06:35] <inflatador>	 ebernhardson good point, just updated my patchset
[20:13:05] <inflatador>	 that was it! Many thanks ebernhardson 
[20:13:19] <inflatador>	 wait..maybe not 
[20:13:23] <inflatador>	 we are getting further though
[20:14:02] <inflatador>	 OK, now we're getting S3 errors...time to follow d-causse suggestion ;)
[20:32:31] <gehel>	 ebernhardson: 1:1? Or is it a bad time?
[20:33:17] <gehel>	 Happy to move to tomorrow if that's better. Might give us both time to get ready for ITC
[20:33:49] <ebernhardson>	 gehel: omw, just distracted
[22:45:46] <inflatador>	 ebernhardson we're getting further, but now err with timeouts to kafka. Does the rdf-streaming-updater try to read Kafka from both DCs? I was thinking it just reads the local Kafka
[22:47:36] <inflatador>	 ebernhardson NM, I think we just have the wrong kafka settings for staging...needs kafka-test instead of kafka-eqiad
[22:56:16] <inflatador>	 nope, still acting like it can't connect...hmm
[23:07:43] <inflatador>	 Out for the day...will check on kafka connectivity tomorrow