[06:34:21] seems like it passed with 16g + 8g overhead [06:34:50] seeing 12g overhead in the code, guessing that another partition failed with 8g [06:36:00] should we change the code and call repartition explicitly on the input to force a shuffle [06:59:01] not sure to understand what's happening on wdqs machines... [07:09:09] Strange that it is only machines in eqiad, which should have less load... [07:15:19] 27k reconciliation events in the last 50k events [07:17:30] this is too much, a reconciliation triggers the old full reconciliation code which is slow [07:18:01] seeing mirrormaker issues, wondering if it has affected us [07:21:54] does not seem like wdqs@eqiad was depooled [07:23:49] Oh, interesting! We should now have enough capacity in codfw, so we should probably have depooled eqiad. Let's check with inflatador and ryankemper if they still want to depool... [08:44:12] 27k late events in eqiad that were reconciled [09:01:47] sigh I think I understood... the wdqs updater experiment running in dse-k8s is sending its problems to the error topics directly on kafka-jumbo, which end up being the same that we mirror from kafka-main [09:03:03] pfischer: 1:1? https://meet.google.com/vkf-mkgd-ywo [09:05:02] mirrormaker issues this night (main -> jumbo) caused many events to be considered late by the wdqs updater running in dse-eqiad but these errors were reconciled by the updater running in production for eqiad... [09:14:18] there's another pile of late events that are going to enter hdfs which are likely to cause the same issue again [09:21:59] and ofcourse there's no ways to distinguish prod ones vs dse-k8s ones, the datacenter topic prefix is the only criteria used for the two prod jobs and of course the dse-k8s job is using "eqiad" [09:22:50] why the --pipeline flag of the SUP flink jobs is important, something that we did not do in the wdqs updater... [09:38:32] anyways... I manually marked as "Success" (to skip it) the reconcile@eqiad task for the 2023-09-27T07:00:00 partition which has 40k late events [09:38:46] will have to file few tasks now... [10:27:29] lunch [12:21:24] lunch [13:07:19] dcausse: speaking of the —pipeline flag: I currently map this to emitter_id in the fetch_error schema (which is inherited from the common error scheme and described as ‘identifies the entity where a fatal failure occured’). Should I a) rename and b) add it to the rdf updater? [13:11:13] pfischer: not sure yet but probably something along those lines, I'm filing a task for the rdf streaming updater, it does not use the common error schema for these 3 streams (fetch_failure, lapsed_action, state_inconsistency) https://schema.wikimedia.org/#!//secondary/jsonschema/rdf_streaming_updater) [13:13:21] for the SUP emitter_id, seems correct to me unless you think it's going against the intention of the schema [13:17:46] Okay, thanks! Right now we pass ‘producer’ or ‘consumer’ as argument but we’d have to make sure that we get more information in there (DC, k8s cluster, etc.) [via helm file] [13:20:57] I think they should be more "personalized" imo, something like when you read it you immediately know what flink job is going to have to ingest an event to attempt to fix this issue [13:21:55] so yes having the k8s cluster and the DC might be good, if we end up running tests like what we do for the rdf updater in dse-k8s we can tag them with "test" somewhere in this name [13:21:59] o/ [13:23:39] o/ [13:24:53] sounds like we had some 'fun' with kafka/streaming-updater [13:30:51] yes... failure on the mirrormaker between kafka-main -> jumbo uncovered a nasty scenario in the streaming-updater [13:31:33] where failures identified by the dse-k8s experiment where fixed by the production job running in wikikube [13:32:00] and while few fixups are expected from time to time here we got way too many [13:53:05] dcausse hmm, I hate to bug with something else but I completely forgot we have the DC switchover [13:54:53] I'm up at https://meet.google.com/aod-fbxz-joy if you want to discuss [15:52:16] workout, back in ~45 [16:43:31] back [17:16:14] dinner [18:04:08] lunch, back in ~45 [18:47:34] back [19:23:01] does anyone know what the java errors here might point to? https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Mjrk1YoBYI7nmYca-j1I [19:23:14] based on a quick web search, it sounds a network connectivity issue [19:23:56] more context in ticket https://phabricator.wikimedia.org/T347521 [19:24:06] inflatador: RejectedExecutionException is a super generic java error, comes out of most any concurrent executor [19:25:30] is this the first error that comes out? looking [19:26:22] https://logstash.wikimedia.org/goto/ae78bc772ed36f3f47e481d95a69a4d6 looking here but I could be missing something [19:27:23] and this is from flink-kubernetes-operator, meaning the admin_ng part of things? [19:28:30] Y...I'm checking it out for g-modena. This is for flink-operator in the prod eqiad env [19:29:52] looks like operator failed over at `2023-09-27T04:45:05.377639Z` [19:30:43] we're not sure the operator is broken BTW...main thing is that helmfile does nothing when we try to deploy [19:32:03] might try to force a failover of the operator a la https://phabricator.wikimedia.org/T340059 [19:32:53] yea can see in the logs it seems to have started over there, the suspicious error to me cloesst would be 'Unable to update LeaseLock' just before failure, but who knows what that does :) [19:34:27] yeah, sounds cluster-y whatever it is ;P [19:35:37] This also seems suspicious: unhealthy event sources: {flinkdeploymentcontroller={io.javaoperatorsdk.operator.processing.event.source.informer.informereventsource#703644914=informereventsource{resourceclass: deployment}}, flinksessionjobcontroller={}} [19:37:22] We had issues with kafka and mirrormaker early this EU morning too...maybe related? [19:38:20] it seems actually at 4:31 is when things started failing, and 4:45 is about when things seem to have started working again [19:41:17] I forced a failover of the operator, but still can't deploy mw-page-content-change-enrich to prod, helmfile just doesn't appear to do anything. [19:41:25] :S [19:41:35] checking logs for mw-page-content-change-enrich namespace now [19:46:10] ah, here's the original log msg g-modena linked in Slack https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.39?id=Xqro1IoBYI7nmYcaEY_P [19:46:46] sounds similar. In a quick look around it appears that is supported by etcd, any chance there were etcd problems? [19:47:16] not seeing anything relevant in SAL [19:49:43] not that I know of. I'm web searching around for that error msg. Seems like something like that might happen w/control plane issues and/or upgrades [19:50:16] I feel like j-ayme would've mentioned something like that though, he's seen the issue too (convo sprawl between slack and at least 3 IRC channels) [19:55:48] quick break, back in ~20 [20:23:48] back [21:36:05] got another partman patch if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/961478 [21:44:15] i still don't know much about it...but i can verify it doesn't contain the mistake we found yesterday :) [21:46:09] yeah, I was paranoid yesterday because 2 other people had imaging failures right at the same time as my merge [21:46:37] I pulled out all the config, but it was just a coincidence