[08:31:26] re "serializer in the producer window blew up" is concerning, might mean we made a non compat change to some POJOs [08:32:11] or a bug in wmf eventutilities [09:04:08] dcausse: thanks, I’ll look into it [10:16:19] lunch [13:20:05] o/ [13:27:45] ebernhardson: I just ran a deserialisation test against the last ~2million messages on mediawiki.page_change.v1 and non of them failed: {REV_BASED_UPDATE=1989940, PAGE_DELETE=14357, REDIRECT_UPDATE=4617} [14:19:53] ebernhardson: fix for emitter NPE should be this: https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/49 [14:23:30] I assumed that the serialization issue might have happened because we read an old checkpoint/savepoint after making a schema and/or POJO change [14:29:14] So we would be able to work around that by instructing the flink producer explicitly not pick up a snapshot but rather start from scratch. [14:35:48] yes we can always discard the state to unblock things, but it'd be nice to understand what caused it I think [14:52:56] not sure I find the problem looking through logs, seeing java.lang.NoClassDefFoundError: org/apache/hc/core5/http/impl/nio/AbstractHttp1StreamDuplexer on the consumer tho [14:54:27] errors in the last 24h: https://logstash.wikimedia.org/goto/f70db2db3b8b27d7c21487c0094bcf9c [14:55:42] ah this one: https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.43?id=IXYHbosB2F9ZGV9iacq- [14:56:17] \o [14:56:52] o/ [14:56:59] o/ [14:57:17] thats actually not the same place, thats yet another place that had a serialiation isue :) [14:57:28] ah :) [14:57:33] could be same issue though? seeing if i can find the error i was looking at [15:01:42] dcausse: how do you get those logs? the k8s applications dashboard doesn't want to show them to me [15:04:43] or i mean, i see a couple logs in the k8s app dashboard, but only ~100 for the lsat 24 hours so not all of them [15:06:46] huh, apparently the k8s app logs dashboard queries `logstash-k8s-....` and yours is `ecs-k8s-...` [15:07:01] ebernhardson: I go to https://logstash.wikimedia.org/app/dashboards#/list?_g=h@c823129 and search "app logs - ECS" [15:07:11] yes [15:07:36] it's still mixed up, hopefully all k8s apps should write to ECS someday [15:07:38] somehow a small bit of our logs do end up there though, curious [15:07:47] of our own logs? [15:07:52] well, flink stuff [15:07:56] hm... [15:08:05] maybe it's k8s about flink, sec [15:08:08] might be because it does not respect ECS format? [15:08:22] well, things like "WARNING: An illegal reflective access operation has occurred" from flink-app-producer-taskmanager-1-1 [15:08:27] ah [15:08:58] yea othres look similar, so this is perhaps logs that failed json parse? [15:09:06] these ones are from the jvm and do not go through log4j2 and they're plain string not kson [15:09:21] s/kson/json [15:11:34] bah, error.stack_trace is unindexed so we can't get a list of all exceptions? [15:13:50] the dashboard is not great for this IIRC, looking via "explore" [15:15:38] filtering for error.type field existing seems a reasonable proxy [15:16:15] in the "App Logs - ECS (Kubernetes)" you can open one event and click on "Toggle column in table" on the error.stack_trace should make it visible [15:16:33] and yes adding error.stack_trace:* in the filter should narrow the search space [15:16:55] when i tried error.stack_trace i got errors from shards that error.stack_trace is unindexed [15:19:03] hmm, i'm not seeing the specific bits i remember seeing in the flink ui, in particular it showed the window as receiving, but the exceptions i see here suggest it's the prior operator sending to the window, instead of the window receiving [15:19:13] but perhaps i was mistken, shrug :) [15:21:56] ah this one: https://logstash.wikimedia.org/app/dashboards#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-k8s-1-1.11.0-6-2023.43?id=Mon2bYsBC5yW6bzk-B_9 ? [15:23:00] hm no does not mention the window [15:23:00] your much better at this than i am :) yea that looks like it [15:23:29] it does in the message: TumblingEventTimeWindows (1/1)#6 (e3977b2565cc746f512e35521000b9d9_8dc1b516f5d75ef95a8ed072781051bf_0_6) switched from RUNNING to FAILED with failure cause: [15:23:46] oh indeed [15:24:03] the stack has PojoSerializer so might be something from us [15:24:50] do, we have actually 3 serialization schemes for UpdateEvent? :) [15:25:13] the internal serialization, the between job serialization, and the elastic serialization [15:26:27] yes [15:26:53] if this stacks is about UpdateEvent then it's about the internal serialization [15:27:11] done via flink's Pojo support [15:27:27] failing in a list of strings here [15:27:28] yea seems like it. tbh i have no clue where to go with that kind of thing, generate serialized versions in prior code base and try and deserialize them with the new one? [15:28:00] do you rememeber if it happened right after an upgrade? [15:28:04] but, it wasn't clear we are providing our own serializers here, or at least i didn't see where [15:28:49] dcausse: it would have been after shipping a container update, lemme double check the repo log [15:29:03] also things might be different if running from a checkpoint or from a savepoint [15:29:16] the deploy would have been v20231025192434-a632a99..v20231026212721-18aa693 [15:29:59] if from a checkpoint I'd have to check but it's possible that flink does not go through a state migration [15:30:25] from git log, i'm not seeing model field changes [15:31:07] it means that https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/44 might not work if applied over a checkpoint [15:32:07] but I might be wrong, perhaps flink does state migration over a checkpoint [15:32:17] hmm, perhaps. we can certainly just clear state at this point but I wonder how we recognize these changes (from a test?) so we know before deploy [15:34:04] when UpdateEvent.current.snapshot.bytes changes this is an indication that the state might have to go through state migration [15:34:40] also once running in production we should no longer allow making changes to UpdateEvent.v1.bytes [15:35:07] ok [15:35:08] we should add a new UpdateEvent.v2.bytes blob [15:39:40] dcausse there's a config key 'mediawiki_max_concurrent_requests' that appears in the dse-k8s experiment, but I don't see it anywhere else. Do we need that for the prod app? [15:40:19] inflatador: yes this one is pretty new and related to T346456 [15:40:20] T346456: Improve concurrency limits configuration of the wdqs updater - https://phabricator.wikimedia.org/T346456 [15:40:43] I only tested ^ in dse-k8s [15:42:25] mediawiki_max_concurrent_requests should replace wikibase_repo_thread_pool_size [15:43:33] hm indeed, not seeing the pojo change between v20231025192434-a632a99..v20231026212721-18aa693 :/ [15:48:08] ACK, will add [15:59:14] dcausse: when I decode the v1.bytes with the current type info (containing the fetched_at flag) it deserialises w/o complaints [16:04:34] OK, I think https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967229 is ready for review again. Feel free to wait 'till Monday, though ;) [16:06:02] workout, back in ~40 [16:53:02] back [17:20:51] ebernhardson: just merged the NPE fix for the emitter. Wasn’t able to reproduce the flink-type-serialization issue though. [17:22:14] pfischer: thanks! I was just looking at the incorrect fixture event you pointed out, indeed i have the wrong one there. Pondering what to do with the actual event, creating a page pointing at a redlink (should be rare, but possible) comes with is_redirect: false [17:25:48] i guess not terrible, but it means we will start indexing redirects to redlinks as normal pages. In a way that seems plausible, since the target page doesn't exist and this will ensure it shows up in autocomplete [17:26:25] lunch, back in ~40 [17:27:27] oh :S is_redirect is also false when creating a new rediret [17:28:06] to a valid page, both valid and invalid get that and have no info about the target page [19:32:29] ebernhardson was looking at https://phabricator.wikimedia.org/T346373 , will we need to install the conda-analytics pkg on the new search-loader VMs? [19:32:44] inflatador: no, it should all be baked in [19:33:34] ebernhardson OK, so scap should set up virtualenvs and whatnot? [19:33:48] inflatador: basically it unpacks the zip and it's all prepared [19:34:49] ACK, I'm working on the puppet code for the new loaders, so far I've just changed ensure-packages to use 'python3' instead of 'python3.7'...if that's wrong or I need to do anything else LMK [19:50:19] PCC is failing on some WMCS bgp stuff...hmm [19:51:10] rechecked and it works /shrug [19:54:12] nothing can ever be easy...looking into this redirect problem, in my local dev env is_redirect is set appropriately. Editing pages on testwiki via api it is not set. Performing the same api edit via shell.php on mwdebug2002 witha debugger attached...works :P [19:58:46] (╯°□°)╯︵ ┻━┻ [19:59:01] ryankemper got a PR for the search-loader stuff if you have time to look https://gerrit.wikimedia.org/r/c/operations/puppet/+/969386 [19:59:20] looking [19:59:20] I'm out for probably the rest of the day, gotta take my son to music lessons [20:04:13] Interestingly based on https://gerrit.wikimedia.org/r/c/operations/puppet/+/969386/1/modules/profile/manifests/mjolnir/kafka_msearch_daemon.pp we've been using the wrong endpoint for searchloader for some time (using the default of `localhost:9200` instead of `https://search.svc.%{::site}.wmnet:9243` which is in hiera) [20:04:35] A little confused how the search loader hosts have been working at all then tbh [20:04:42] huh, that should still work as long as it's http i suppose [20:04:46] http 9200 and https 9243 [20:05:04] oh, localhost [20:05:06] hm [20:05:06] that's what i thought at first but these daemons don't run on the elasticsearch hosts [20:05:09] yeah [20:05:24] doesn't seem possible it's broken, at a minimum other teams would notice [20:05:37] for weighted tags and such [20:07:31] ryankemper: looking at `ps` on search-loader1001 the right value makes it into the command line somehow: --es-clusters https://search.svc.eqiad.wmnet:9243 [20:07:56] Yup seeing the same in /lib/systemd/system/mjolnir-kafka-msearch-daemon@.service [20:10:43] The same variable (`es_cluster_endpoint`) is also looked up (properly) in `modules/profile/manifests/mjolnir/kafka_bulk_daemon.pp`, could that somehow have overwritten the value for the `msearch-daemon`? Seems hard to believe but something must explain the wrong value not being present [20:13:22] Or maybe that's just how erb templating works? The templates reference `<%= @es_cluster_endpoint %>"` and maybe that value is directly read from hiera rather than it coming from the puppet class variable [20:15:12] Anyway, none of this is all too important but just kind of interesting [20:15:28] school run, back in a bit [21:18:03] back for a whopping 45m