[09:12:17] hi, wdqs2009 seems unhappy (a bunch of firing alerts), known? [09:19:04] godog: thanks for the heads-up, will take a look [09:19:31] cheers dcausse [09:25:16] will file a task, blazegraph seems stuck since 2023-10-12 [09:27:07] well... I can restart blazegraph I think we have enough retention in kafka to process the updates [09:39:00] sigh... kafka consumer offsets are gone so it's re-processing the whole topic :/ [09:48:58] errand+lunch [10:40:27] pfischer: going to schedule https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/969064 for this afternoon deploy window, if you have time could you take a look before 3pm cest? [12:35:39] Sure. 👀 [12:43:45] thanks! [12:55:41] o/ [12:57:00] dcausse: BTW: I looked into the non-failing ITs for the ES emitter. Turns out a) ES’ bulk API does not expect the whole HTTP body to be valid JSON but rather every line in it b) wiremock’ bodyPattern.equalsJson does not seem to care the actual body is no valid JSON and ends up considering it a match for whatever reason [12:57:58] https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/46 [13:00:07] I can get a data-transfer started for wdqs2009 [13:05:08] pfischer: thanks, will take a look [13:06:01] inflatador: a data-transfer might allow wdqs2009 to get back online quicker indeed [13:12:01] pfischer: is this file was originally created manually or automatically by wiremock? [13:13:06] This file was created by wiremock (if you run the IT with the env PROXY_ELASTICSEARCH=true) [13:14:49] and the new version of this file is generated automatically as well? not sure to understand why it's different without a code change [13:23:25] Ah, my bad. I adapted it manually. I can either automate that or at least document it. [13:26:50] np! just wanted to know, a comment might be sufficient if it's hard to convince wiremock to treat this as plain string [14:03:55] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/967229 is updated for the rdf-streaming-updater migration. Looks like it has a merge conflict tho [14:10:54] OK, fixed [14:12:37] dcausse: added a comment in the test [14:12:49] thanks! [14:12:58] I set the yaml files as "values-${env}-${release}" and did not use any "values-${env}.yaml" files, as it didn't seem like there were too many configs that were appropriate for both commons and wikidata in a specific env [14:13:52] ryankemper: I can't seem to find the grafana panel with the ES lag metric that started to go bonkers around mid-august. Do you have it somewhere? Thanks! [14:15:04] I’m trying to get a ranking of wikis based on edit-frequency to get a list of wikis to enable for SUP but I’m somewhat stuck: quarry only lets me query one wiki at a time. Do we have such data available elsewhere? [14:16:41] off, back for retro [14:16:50] pfischer: edits meaning "all kind of updates we cover" or only revision based ones? [14:57:25] pfischer: one option might be the hadoop versions of the db, could probably coax some nuber out of the wmf.mediawiki_page_hitsory table [14:58:28] it will be a historical number, i think that table is populated once a month, but close enough. There is probably some top level edit numbre somewhere but not sure where. As a super simple option a bash script could select max(rev_id) from a bunch of wiki databases :) [15:01:39] perhaps the metric behing https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&viewPanel=44 might cover what you need? (this is if you don't need explicitly revision based updates but all updates we might ship to elastic) [15:59:26] workout, back in ~40 [16:04:55] ebernhardson: might you be able to spare some time to talk split-the-graph spark stuff? i added you to a call in case you can make it. also interested to learn ide tips from you and dcausse [16:19:08] Is there a repository anywhere documenting how Wikidata Query Service puts Redis in the middle of Blazegraph and the frontend? [16:35:38] dcausse: + ebernhardson: thanks, for the sources, graphite looks promising, I’ll have to look into extracting the data in a tabular format tomorrow [16:55:45] sorry, been back [17:13:32] https://phabricator.wikimedia.org/T349848 is up for the SUP throttling discussion [17:19:33] ryankemper I started data reloads for wdqs1023 and 1024. 1022 is still in progress. Since the process is so error-prone, I thought it would be good to have multiple attempts [17:28:24] inflatador: ack [17:32:18] brouberol: https://grafana.wikimedia.org/d/8xDerelVz/search-update-lag-slo?orgId=1 [17:37:47] dinner [17:39:11] lunch, back in time for pairing [17:59:24] hmm, producer managed to NPE extracting the grouping key from an event. Somehow the Update object has a padId, but the target document has a null page id [17:59:34] for a redirect update [17:59:52] s/padId/pageId/ [18:08:57] ack [18:23:37] If i'm understanding this right, this was deleting a redirect to a deleted page, and thats why there is no page id in the event [18:24:13] i wonder if we generally handle redlink redirects correctly [19:03:59] back in ~90m [19:17:19] (the answer was no :P but it was easy to write a patch) [20:36:50] back [21:32:52] Does anyone know what "group" means in the promQL query here? https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&viewPanel=8&forceLogin=&from=1698355034117&to=1698355934117&editPanel=8 [21:33:23] inflatador: thats the kafka consumer group id [21:33:26] I'm guessing it's referring to the kafka topics that are on the dashboard? [21:33:56] Ah, do we have one for cirrussearch? [21:34:22] inflatador: when we create a consumer in kafka we give it a name, kafka uses that name to track offsets and remember the last messages read. It's possible to read kafka without a group id (for example, kafkacat) but it's rare in prod [21:34:24] inflatador: yea [21:34:40] inflatador: grep the cirrus helmfile stuff for group.id [21:35:14] ebernhardson ACK, will do [21:37:23] Awesome, we have our first working dashboard pane [21:37:44] still needs a ton of work, but ... https://grafana-rw.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater?forceLogin=&orgId=1&var-datasource=eqiad+prometheus%2Fk8s&var-site=eqiad&var-service=cirrus-streaming-updater&var-prometheus=&from=1698355354615&to=1698356254616&var-k8sds=eqiad+prometheus%2Fk8s-staging&var-opsds=eqiad+prometheus%2Fops [21:41:04] cool. Those aren't moving because it's currently broken, but hopefully i'll have a fix soon :) [21:46:54] no rush, I'm just filling stuff out at this pt ;) [21:49:03] there are more page change events than i had expected, the backlog is 1.8M events and it's only been a day or so [21:51:53] !!! [21:52:33] That's for test, french and italian wp? [21:52:55] no, thats the whole production stream. All wikis are in one stream, we have to filter from there [21:53:09] or really, how far behind the producer was since it ran into the redirect issue [21:53:24] deployed a fix, so now it's complaining about a new thing of course :P [21:53:25] oh yeah, sorry. we talked about that earlier [22:15:21] hmm, so three fun problems (at least) :P 1) serializer in the producer window blew up, read beyond the end of the record. 2) NPE from UpdateEvent.KEY_SELECTOR, suggesting a null TargetDocument. And NPE from the elasticsearch emitter on null noop hints [22:16:01] null...the billion dollar mistake :P [22:18:12] mullin' nulls [22:48:59] Gym, back in 1hr