[08:22:54] pfischer: o/, re using local disk, might perhaps be possible but then we have to tell k8s to not recreate a pod when the job restarts and/or use persistence volumes, that part seems harder [08:23:13] on the hand it should be possible to download a checkpoint/savepoint and investigate locally [08:23:46] *other [10:53:35] lunch [13:06:31] dcausse: thanks, I attached my local instance of the flink application to the kafka-main topics and can reproduce the errors described by Erik. Since I do not use S3/swift locally, the cause must be something different. [13:07:36] pfischer: nice! curious to know what's the cause :) [13:09:55] Me too, it’s interesting to see the app run against production loads. [14:15:07] o/ [14:29:55] pfischer: (and others): the SUP sync meeting tomorrow conflicts with the Airflow deep dive. Do we still need this sync? Should we skip tomorrow? Or cancel completely and call a meeting when there is a need? [14:33:48] ebernhardson: I'm moving our 1:1 30 minutes later to make some space. Let me know if that's hurting your lunch break and I'll find some other time. [15:08:16] gehel: The way I see it, there are currently no blockers that need a meeting like this to unblock them. We are allowed to use kafka-main + swift & zookeeper for state handling. inflatador: Do you see any blockers, that we could use help with? [16:00:20] ebernhardson we're deploying rdf-streaming-updater w/flink-operator for the 1st time. Getting perm errors around creating flinkdeployments: https://phabricator.wikimedia.org/P53141 . Did you have this problem with cirrus-streaming? [16:01:28] inflatador: nope, no problems there [16:01:37] pfischer: triage meeting: https://meet.google.com/eki-rafx-cxi [16:22:53] blazegraph@wdqs1014 seems stuck with a very high threadcount and might need a restart [16:31:11] * ebernhardson wishes phabricator understood UTC means i don't care about timezones [16:34:30] rebooted wdqs1014 , watching lag [16:37:38] inflatador: maybe you need to apply on admin_ng, if you recently added rdf-streaming-updater to staging watch namespaces? [16:56:12] pfischer: so i spent better half of friday investigating that serialization issue...and other than attaching to kafka-main don't feel like i got anywhere [16:57:08] i can keep looking into it, my current suspicion is still somewhere along the lines of multiple threads conflicting, perhaps writing to the same MemorySegment object or some such [16:58:31] break, back in ~30 [17:07:08] is the serilization issue always after the "rawFields" field? [17:08:34] might be a stupid question, I guess the stack might not be sufficient to determine what it was reading [17:12:15] dcausse: hmm, i'm not sure about the order there it's hard to tell [17:12:34] * ebernhardson feels like the elasticsearch serialization scheme, while more verbose, is quite a bit simpler :P [17:13:30] dcausse: no it's early on. first change type, event time, ingestion time. This is where it breaks [17:13:52] another curious thing, it doesn't fail the same way every time. There are 3 or perhaps 4 failures it seems to almost randomly choose [17:14:00] yes... that's my impression too, whenever you try to make things serialize "automagically" you hit a wall unless you use really simple model classes [17:14:49] fields are sorted alphabetically by the pojo serializer BTW [17:14:55] i suppose on the upside, those failures it rotates through seem to repeat in the same way, so it's probably at least partially deterministic [17:18:08] ebernhardson: another thing: I noticed canary events slipping through, see MR [17:19:17] pfischer: curious, the event had a domain other than canry? [17:19:55] Obviously, at least it showed up way too late in the graph [17:20:29] i wonder how that happens :S [17:20:51] i guess maybe it's because of the different input schemas, and they don't all define their canary the same way? [17:22:36] hmm, the schema examples don't say canary directly, that must be applied somewhere else (perhaps the producer). Anyways it's probably fine, I'm just curiious where the variation comes from [17:22:58] I thought they are created dynamically by event gate. Wording is my continuous state since attaching to the proxied brokers [17:23:26] i was under the impression it takes the example event from the schema and produces it, but i suppose i'm not sure [17:23:35] s/wording/wondering [17:43:34] back [17:50:41] ottomata I deployed admin_ng on staging, are you saying I might need to apply it to prod envs too? [17:52:24] inflatador: no, if you are only deploying now in staging don't need to do prods. mostly was asking if you did what you already did :) [17:59:05] the flink-operator recognized that it's watching the rdf-streaming-updater namespace as of 5 minutes ago. Only thing I did was delete the networkpolicy associated with staging rdf-streaming-updater...that's caused problems in the past [18:07:22] curious, i deleted the second event from UpdateEventConverters.fromPageChange and it stopped failing (but i also deleted like 15 other things :P time to put some of it back) [18:10:54] so, if i change the second event to create a new TargetDocument, instead of reusing the one from sourceEvent, it seems to stop having serialization issues. [18:11:17] oh [18:12:22] i can't explain why though :P It seems reasonable to me and TargetDocument is a static subclass so it shouldn't be holding weird state [18:12:33] err, s/subclass/child class/ [18:14:37] you mean if you make a copy at: ImmutableSet.of(sourceEvent.getTargetDocument()) [18:14:47] dcausse: yea, use toBuilder().build() [18:14:57] hm... [18:16:18] it then fails after about a minute about not finding fiwiki, but i think thats unrelated [18:16:51] java serialization would try to map that to the same instance if they happen to be serizalized in the same stream, I don't think that flink Pojo has this kind of magic [18:17:28] indeed not sure to understand yet why having the same instance could cause weird issues here [18:18:30] hmm, no maybe i spoke too soon. it did run a few times for longer, but now it's back :S [18:19:05] but commenting out events.add(targetEvent) seems to let it continue on [18:22:44] * ebernhardson isn't really making heads of tails of this, just deleting things and seeing what happens :P [18:25:49] ah StreamSupport.stream(converter.apply(row).spliterator(), true) [18:25:57] it's parallel [18:26:46] * ebernhardson remembers a recent HN thread about java stream apis, and how parallel streams are mostly for creating bugs :P [18:26:53] collector.collect might not like that [18:26:59] :) [18:27:34] so, if collect.collect isn't thread safe then both things could get serialized into the output [18:27:40] you guessed right but spotting that with the stream api is definetely hard [18:27:41] overwriting each other [18:27:42] yes [18:28:50] so indeed setting that to false seems to be running now. woo! [18:29:00] :) [18:29:22] then it fails because it thinks fiwiki is a hostname instead of a domain name, something wrong in our namespace lookup but thats fine [18:29:35] err, it thinks fiwiki is a domain name instead of a database name [18:30:08] seems more like the kind of bug we expect :P [18:30:55] indeed, this i can figure out in an hour or two i imagine :) [18:31:05] OK, I think this will fix the k8s RBAC stuff for rdf-streaming-updater https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972005 [18:33:29] inflatador: seems reasonable but I have no clue how all this works... [18:33:40] adding Andrew to the patch [18:35:01] oh that's because we had a special "deploy-flink" user [18:35:23] got a +1 from j-ayme as well [18:35:50] ottomata if you have any feedback on ^^ let us know [18:50:50] OK, rdf commons and wikidata releases are deploying in staging [18:52:11] getting some jaas/auth errors [18:52:52] `"ERROR","message":"Authentication failed", "ecs.version": "1.2.0","process.thread.name":"main-EventThread","log.logger":"org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState"}` [18:53:07] It's also warning about ZK auth, but we don't have any so that should be OK [19:02:13] ZK created the new znodes for flink-app-commons and flink-app-wikidata no problem [19:20:41] hmmm https://support.huaweicloud.com/intl/en-us/dli_faq/dli_03_0165.html [19:26:56] ran a few days of events through the producer on my laptop, doesn't fall over anymore :) unclear if things are correct, but better [19:27:54] https://phabricator.wikimedia.org/P53141#215081 updated with the flink error...no clue what it's trying to auth to that's causing the failure [19:28:18] guessing it's more kubernetes RBAC stuff [19:43:05] inflatador: this only auth that I know of is swift, might be that the swift key is not in the right place (see how secrets are mapped in /etc/helmfile-defaults/private/main_services/rdf-streaming-updater/staging.yaml) [19:43:48] dcausse ACK, will take a look at that [19:44:28] i looked over those errors, but it's not clear to me if thats a problem. Unfortunately it's common practice in java to log stack traces that aren't a problem. Is this under `kube_env rdf-streaming-updater dse-k8s-eqiad`, or how would i find the containers? [19:44:47] ebernhardson `kube_env rdf-streaming-updater staging` [19:45:40] flinkdeployments are hanging in `RECONCILING` status [19:48:27] might need to add https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/flink-session-cluster/templates/_flink-conf.tpl#L28 to the flink-app chart [19:48:29] hmm, it does seem to really be insisting on authenticating with zookeeper [19:49:11] well, hmm [19:49:48] inflatador: more the other way around by making /etc/helmfile-defaults/private/main_services/rdf-streaming-updater/staging.yaml looks like /etc/helmfile-defaults/private/dse-k8s_services/rdf-streaming-updater/dse-k8s-eqiad.yaml [19:50:37] rename config.private.swift_api_key to app.flinkConfiguration.s3.secret-key [19:50:43] inflatador: i don't see the zk hosts in `kubectl get -o yaml networkpolicy` [19:50:49] I dunno about ZK, thinking we would have seen this in dse-k8s if that was the issue [19:50:56] inflatador: need a section that defines the allowed zk clusters [19:51:13] I can also see the znodes being created as soon as I deploy the chart [19:53:16] inflatador: i think this is what you need: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/972014 [19:55:09] ebernhardson sorry for the confusion, the changes aren't merged. We do need that but it's active as we're deploying out of a patchset that does have that [19:55:14] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967229 [19:56:21] inflatador: hmm, where are you deploying from? Not seeing it in deployment-charts on deploy2002 [19:56:46] ebernhardson cloned the patched version to my homedir on deploy2002 [19:59:21] hmmm [20:00:16] inflatador: it's gotta be the networkpolicy, compare the output for cirrus updater vs rdf. The question would be why it isn't being used...is egress enabled? [20:01:42] inflatador: you are missing this: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/cirrus-streaming-updater/values-main.yaml#61 [20:04:18] ebernhardson interesting. I'm kinda dubious since I see it making the znodes immediately after deploy, but on the other hand that should be explicit somewhere...let me amend my patch [20:04:28] inflatador: thats probably the admin_ng side? [20:06:35] ebernhardson good point, just updated my patchset [20:13:05] that was it! Many thanks ebernhardson [20:13:19] wait..maybe not [20:13:23] we are getting further though [20:14:02] OK, now we're getting S3 errors...time to follow d-causse suggestion ;) [20:32:31] ebernhardson: 1:1? Or is it a bad time? [20:33:17] Happy to move to tomorrow if that's better. Might give us both time to get ready for ITC [20:33:49] gehel: omw, just distracted [22:45:46] ebernhardson we're getting further, but now err with timeouts to kafka. Does the rdf-streaming-updater try to read Kafka from both DCs? I was thinking it just reads the local Kafka [22:47:36] ebernhardson NM, I think we just have the wrong kafka settings for staging...needs kafka-test instead of kafka-eqiad [22:56:16] nope, still acting like it can't connect...hmm [23:07:43] Out for the day...will check on kafka connectivity tomorrow