[07:23:06] very possible that it's a classloading issue, reminds me of a weird bug I saw with scala and flink, the "None" constant had 2 different instances so a simple pattern matching case None => foo did not work [07:23:47] possibly we embark some flink core classes in the fat-jar and that is not ok [08:39:48] ebernhardson: in case you use IntelliJ as IDE: it creates a history of snapshots of files you changed that go beyond the regular undo history. [10:09:00] lunch [13:10:58] o/ [14:40:46] dcausse for the operator migration, any opinion on whether we should move default flinkConfiguration out of values.yaml and into a _flink-conf.tpl like it is with the session cluster? [14:42:39] to be clear I guess most of the values come from the values.yaml, but there are some static configs like rocksdb stuff in the _flink-conf.tpl [14:43:27] inflatador: hm... I don't think we can use a template in a services folder, it'd have to go in the flink-app chart and if it goes there it has to be very generic [14:44:32] dcausse yeah, would definitely go in the charts folder. Just wondering if it would be worth the effort for the small amount of hard-coded values (like rocksdb)? Or should we just put them in values.yaml [14:46:06] inflatador: given that WDQS/WCQS is the only job using rocksdb I feel that it's better to add them in the top-level values.yaml file [14:46:49] dcausse ACK, will do [14:46:53] but if you see values that are likely to be repeated across all WMF jobs it might make sense to move them in the flink-app chart [14:47:27] but perhaps not via a template tho (can't remember how flink conf values are set there) [14:48:36] we're using one for the session cluster, but it only has a few hard-coded values https://github.com/wikimedia/operations-deployment-charts/blob/master/charts/flink-session-cluster/templates/_flink-conf.tpl [14:49:01] \o [14:49:07] most of it appears to be loaded from the values files anyway [14:49:09] o/ [14:49:29] yes they should [15:05:56] dcausse: any useful ideas on the multiple BasicTypeInfo problem above? I think class loader is ruled out, since `typeInfo == BasicTypeInfo.STRING_TYPE_INFO` is false, but `typeInfo.getClass() == BasicTypeInfo.class` is true. I suppose if it went through a serde cycle that could create a new instance, but shouldn't happen [15:07:02] ebernhardson: you said that it's only reproducible from prod and integration tests? [15:07:26] dcausse: it happens reliably in prod and consumer IT test, i got it to happen once in a regular test case but not since [15:08:57] I'm not sure to understand why you're sure that classloading is not the issue? [15:09:24] dcausse: my understanding is typeInfo.getClass() == BasicTypeInfo.class would return false if the classes came from different class loaders [15:09:33] because it has to be the exact same object [15:10:38] hm... depends from what classloader typeInfo.getClass() and BasicTypeInfo.class are coming from [15:10:56] what matters I think is when serialization happens [15:11:16] hmm, i suppose yea [15:11:38] I suppose they could be the same depending from where you do this test [15:11:39] hmm, maybe there is some jar hell addon i could stuff in there? [15:11:58] * ebernhardson remembers something from elastic that protected against multiple classes [15:12:14] flink does not have that sadly :( [15:13:27] * dcausse building the jar locally to inspect it [15:13:33] i was doing the test from a breakpoint where we decide the serializers to use, in JsonRowSerializatoinSchema.createConverter. We have a slightly modified version of flinks that lives in event utilities [15:13:54] thats the are that does == checks. I suppose i could change those == checks to .equals(), but feels like giving up :P [15:14:08] also because the == come from upstream, we didn't write most of this just imported and made some small changes [15:16:46] hm.. indeed the IT resources have create_timestamp\":{\"nano\":0,\"epochSecond\":1669130930 :/ [15:17:03] yea, that's how i finally found something, noticed that in grep [15:17:35] I wonder what serializes to to this... [15:17:45] dcausse: thast the fallback serializer [15:17:47] could it be jackson? [15:17:51] ah [15:18:26] dcausse: from JsonRowSerializationSchema.crateFallbackConverter, it uses ObjectMapper.valueToTree [15:19:51] yea we have org/apache/flink/api/common/typeinfo/BasicTypeInfo.class in the fat jar [15:20:13] * ebernhardson found this all a bit confusing until realized i should be debugging schema creation instead of actual serialization. Serialization just goes off on wildly different code paths because the schema is different :P [15:21:16] oh yes that can be disturbing... [15:23:11] dcausse: so i only see a single instance of BasicTypeInfo in the fat jar, is perhaps the flink operator or something else also providing flink jras? [15:23:24] hmm, it wouldn't be the operator because the consumer it fails [15:23:36] it's provided by flink itself [15:23:53] in the lib folder [15:24:10] ahh, so we need to exclude all of flink from the fat jar? [15:24:16] i just targeted? [15:24:20] s/i/or/ [15:24:21] yes that's what I would try [15:24:35] ok, it's something i can try [15:25:01] looking at deps I'm not sure what deps to exclude/tune flink-core seems to be coming from different deps we import [15:25:17] flink-table is one of them [15:27:03] * ebernhardson starts playing with the exclude hammer [15:27:09] :) [15:33:09] hm 78M for the fat-jar this should have rang a bell, the wdqs job is ~36M [15:43:36] hmm, far jar is now 29M (consumer), BasicTypeInfo is no longer found (via `jar tf fat-jar.jar` | grep BasicTypeInfo) in the fat jar, but still has the same issue with fallback converters :S [15:43:42] s/far jar/fat jar/ [15:44:27] although it's not clear this test uses the fat jar directly, not sure yet how to verify what exactly happens there [15:44:35] :/ [15:50:11] you can access and print the classloader from the class itself, it can also show from what path it has loaded the class [15:54:27] no fun :S typeInfo.getClass().getProtectionDomain().getCodeSource().getLocation() gets the same result as calling it on BasicTypeInfo.class, [15:55:57] private transient JsonRowSerializationSchema schema vs private final TypeInformation fieldsType [15:56:11] both are serializable [15:56:27] this is from UpdateElasticsearchEmitter [15:56:32] I wonder why one is transient [15:58:06] Looking into it [15:59:32] I wonder if building JsonRowSerializationSchema eagerly and passing it through serialization could help [16:00:07] BTW: K8S interest group meets in 30s - today: local k8s testing [16:00:38] seems interesting :) [16:46:56] Hm, I was hoping for something more like a silver bullet I guess. [16:48:05] dcausse: I can refactor the emitter to have no transient properties anymore. Both should be serialisable so need to keep them as transient fields. [16:49:01] ebernhardson: I’m still trying to understand, how you made those serialisation issues visible locally. [16:51:04] pfischer: the Consumer IT always triggers it, they are even written to the fixtures [16:51:20] pfischer: grep for 'epochSecond' in the repo to find it [16:51:48] 👀 [16:54:53] but indeed in a normal test case i've tried quite a bit, but i'm not able to reliably trigger this outside the consumer IT [16:56:32] ebernhardson the mediawiki testing tools you mentioned in the k8s meeting, would that be mediawiki-vagrant or something else? [16:57:13] inflatador: mw-vagrant and mw-cli, both have the same general purpose. They set up a bunch of infrastructure so that you can test things [17:17:27] ebernhardson: okay, we have to instruct jackson to serialize timestamps in a different format. I’ll fix that. [17:20:13] lunch/workout, back in time for pairing [17:20:14] pfischer: doesn't the code already do that though? [17:20:40] pfischer: in JsonRowSerializationSchema.createConverter if does `if (simpleTypeInfo == Types.INSTANT) { return Optional.of(this::convertInstant) } [17:21:20] pfischer: the problem is simpleTypeInfo is BasicTypeInfo, and Types.INSTANT is BasicTypeInfo, and simpleTypeInfo.equals(Types.INSTANT) is true, but simpleTypeInfo == Types.INSTANT is false [17:22:43] pfischer: note that there are two JsonRowSerializationSchema implementations, one in flink and a second in eventutilities-flink, we are invoking the second [17:23:33] err, those should both say BasicTypeInfo :) [17:24:12] * ebernhardson has been looking at the string case too because it fails the same checks and hits the fallbacks [17:24:31] it's just that the fallback works fine in the String case [17:24:45] ebernhardson: the payload we create for ES is serialised by an ObjectMapper (jackson) no flink serialisation is involved at that time [17:25:29] pfischer: the fields do [17:25:50] We use the ObjectMapper to map a POJO into a Map, that is then passed as source property to the ES Script DTO [17:26:32] The Map already has the wrong representation of the timestamp [17:26:33] pfischer: in UpdateElasticserachEmitter.toMap we use the JsonRowSerializationObject [17:31:21] when i set a breakpoint on getSchema().serialize(fields) i see fields contains an instant, and new String(getSchema().serialize(fields)) contains the bad serialization. getSchema() is returning the JsonRowSerializationSchema and setting some breakpoints i can see that it hits the code that checks == Types.INSTANT when invoking the JsonRowSerializationSchema builder [17:32:08] 👀 [17:51:38] hm I'm able to serialize create_timestamp":"2022-11-22T15:28:50Z" from the IT test but it still passes [17:53:27] yes seems like if we make JsonRowSerializationSchema serializable it seems to work [17:54:49] currently blocked because it carries its own builder that makes it non serializable but building it like: [17:54:55] .withNormalizationFunction((Function, ObjectNode> & Serializable) (o) -> { [17:54:57] ObjectNode node = mapper.createObjectNode(); [17:54:59] o.accept(node); [17:55:01] return node; [17:55:03] }) [17:55:20] seems to workaround the serialization issue [17:55:33] eventutilities should be fixed rather I think [17:57:51] huh, so that basically bypasses the custom serializers and uses jackson directly? [17:58:09] * ebernhardson might not be following :P [18:09:49] so if i'm following, what happens is the fieldsType in UpdateElasticsearchEmitter goes through a serde cycle before running, so when it tries to build the JsonRowSerializationSchema at runtime the fieldsType no longer has the "real" BasicTypeInfo, but instead probably a new instance created by the deserializer [18:12:56] Yes, looks like that’s the issue. [18:13:07] dinner, looking into it afterwards [18:32:13] Hi, FYI i'm upgrading eventgate-analytics to to debian bookworm + NodeJS 18. Letting yall know cuz you are the highest volume users of that instance. https://phabricator.wikimedia.org/T347477#9277370 [18:55:26] I think the UpdateElasticsearchEmitter is deserialized using flink core classloader but then the JsonRowSerializationSchema is constructed in userspace using the user classloader, still unsure to understand why the singletons could be different tho... [18:59:21] flink is child-first but if the class is already loaded in the parent classload it should use it... [18:59:26] dinner [19:38:21] Fix for event-utilities is on its way. Nifty lambda scope… [19:43:29] https://gerrit.wikimedia.org/r/c/wikimedia-event-utilities/+/968348 [19:47:53] ebernhardson: if you +2 I can release a new version of the event utilities [19:50:27] pfischer: sure, looking [19:51:05] pfischer: lgtm, i was thinking the same about that mapping reference to builer [20:00:48] ebernhardson: release is on its way (https://integration.wikimedia.org/ci/job/wikimedia-event-utilities-maven-release-docker/28/), SUP changes are here: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/tree/unnecessary-transient (no PR yet, feel free) [20:00:53] Calling it a day [20:00:59] pfischer: thanks! [20:16:52] afk an hr or so, teacher parent conference day [21:14:37] back [21:15:55] back [21:17:20] ebernhardson didn't see you on the email list, but I created https://phabricator.wikimedia.org/T349666 for the k8s stuff we talked about earlier today...feel free to add/change/etc [22:06:40] getting closer, now we are failing with java.lang.IllegalArgumentException: Event already populated with a cirrus document [22:06:59] not sure yet what thats about