[07:15:29] the wdqs updater is misbehaving on codfw and causing wikidata maxlag (eqiad is fine), still unclear why... [07:29:47] will ask to depool wdqs@codfw while we investigate [07:33:58] asked in #wikimedia-sre [07:42:06] Was that the alert warning over the weekend? I noticed but was not sure how to react. [07:44:51] pfischer: yes the alert should be RdfStreamingUpdaterHighConsumerUpdateLag [07:59:43] dcausse: sorry I'm late to the party. Did you get wdqs@codfw depooled? Do you need my help in any way? [08:02:11] I see the log in #wikimedia-operations, so I assume it's good now. [08:02:26] gehel: yes Emperor ran 'confctl --object-type discovery select 'dnsdisc=wdqs,name=codfw' set/pooled=false' [08:03:48] it seems to be processing all updates as "reconciliations"... this triggers the old updater [08:08:29] seems to be because of late events... [08:09:13] mirrormaker perhaps? [09:06:22] ebernhardson: got a question from Joseph regarding a yarn queue named fifo, it's apparently something we requested some time a ago, told him that we should no longer be using since we use airflow for limiting concurrency, please let me know if I'm wrong and he'll keep it [09:37:06] unsure about the root cause but I think reconcile events are bubbling up. Many have to be collected from a spark job and their meta.dt might be assigned too early in the spark job causing them to be considered late again by the flink job [09:38:12] processing smaller chunks might help perhaps? [09:40:22] or set meta.dt right before sending to eventgate... [09:57:26] lunch [13:16:13] inflatador, ryankemper: for when you'll be online. I've released spicerack with the changes including the elastciseach-curator removal. I'm testing all the other changes on cumin2002. When you're able we can test your changes and ensure the existing cookbooks works as expected [13:19:24] volans excellent! Will take a look [13:19:27] o/ [13:21:42] gehel dcausse how's everything going w/wdqs? Can I do anything to help? [13:25:04] inflatador: I'm still not sure what has caused the issue, something weird happened yesterday between 6am and 12 UTC, seems to have caused a bunch of reconcile events to be created [13:27:29] working on a fix to enhance how these reconcile events are collected and shipped but the root cause is still a mystery [13:28:37] if you hear anything that misbehaved in the kafka land (e.g. mirrormaker) during that time that might help narrow the troubleshooting [13:29:47] I vaguely remember a past issue with mirrormaker publishing stuff from kafka jumbo to kafka main [13:29:57] let me see if I can dig up the incident [13:31:45] if mirrormaker for kafka-main (eqiad -> codfw direction) breaks that could possibly explain, but quickly looked those dashboards and could not spot something weird [13:31:52] https://wikitech.wikimedia.org/wiki/Incidents/2023-09-27_Kafka-jumbo_mirror-makers [13:32:09] no idea if relevant, that's just what I remember [13:32:34] yes, could be a similar issue [13:58:27] pfischer: if you have a moment: https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/1019813 [13:58:53] Trey314159: I'll be 2' late [13:59:40] gehel: no problem [14:16:56] * ebernhardson realizes the reason the hourly transfer backfill stopped over the weekend is because the bit that regularly cleans up old data deleted the backfills :P [14:17:04] pausing that for the moment [14:22:30] other random things...cirrus-sanity-check is now live so i polled it for enwiki starting at page 0, somehows pages 3, 5, 6, 7, 8 and 9 are all ghost pages. How does that happen? [14:23:11] we might start from select min(page_id) ? [14:23:19] but the new updater is starting from 0 [14:23:31] oh, hmm maybe. looking [14:24:58] yea thats exactly, it starts at min(page_id) [14:28:11] the flink job for wikidata in codfw started to throw many java.util.concurrent.CompletionException since 12 apr 6am... [14:28:23] org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException: Could not send message [LocalRpcInvocation(ResourceManagerGateway.registerJobMaster(JobMasterId, ResourceID, String, JobID, Time))] from sender [unknown] to recipient [akka.tcp://flink@10.194.131.45:6123/user/rpc/resourcemanager_3], because the recipient is unreachable. This can either mean that the recipient has been [14:28:25] terminated or that the remote RpcService is currently not reachable. [14:29:11] :S [14:31:21] 10.194.131.45 is the jobmanager... [14:32:13] so some sort of networking issue, i know they've been doing the calico migration in k8s for some charts, but not seeing any of the flink charts touched yet [14:33:06] * ebernhardson is always lost when it comes to networks :( [14:33:19] how weird: this log line comes from the pod flink-app-wikidata-fc7f4588-k9glr [14:33:31] dcausse@deploy1002:~$ kubectl get pod -o wide | grep flink-app-wikidata-fc7f4588-k9glr [14:33:33] flink-app-wikidata-fc7f4588-k9glr 2/2 Running 0 40d 10.194.131.45 kubernetes2048.codfw.wmnet [14:33:54] it's failing to contact itself... [14:33:58] heh [14:34:17] but the job's running, I'm puzzled [14:35:02] well it misbehaved yesterday but that could be totally unrelated... [14:45:11] other random failures...airflow is backfilling hourlies and every ~20 minutes something is stuck on org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. [14:45:25] I remember seeing a patch to get rid of some of the rdf-streaming-updater's permissions now that we use flink-operator? [14:45:50] doubt it's related, but maybe worth a look [14:47:18] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1015343 looks like it was reverted anyway [14:48:51] was re-applied after: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1018214 [14:49:26] could be? [14:54:51] Dunno, I guess it would probably affect eqiad too if so [14:56:58] rolling operation with the new cookbook seems to be working well so far [14:59:54] ebernhardson: regarding the "fifo" yarn queue, Joseph was remembering that it might have been something we asked and was wondering if we still use it? [15:01:40] dcausse: hmm, curiously i don't even see fifo mentioned in the history of the old wikimedia-discovery-analytics repo. Pretty sure we don't use it and depend on airflow now [15:02:01] ok thanks! [15:28:59] used trey's idea, 17 wikis to go apparently [15:29:07] will kick those off in a little bit [15:58:05] ebernhardson: for the wprov piece, possible to have a follow on patch to remove the wprov with pushState for the users who are on mdot/minerva post their click? i saw it got merged (which is fine), just hoping we can tidy up the url bar and get less copy-paste of those urls. maybe i should get a side chat with you and jon in case he wants to do it? [15:58:08] inflatador: I am not permitted to edit the document you linked. It looks good to me. Is there any information you need? Currently only the test environment section seem empty [15:58:26] dr0ptp4kt: hmm, i remember something from timo about the same thing, looking [15:59:47] pfischer sorry I screwed that one up. I just need the test env [16:00:12] oh, like if that adds a mild perf penalty on mobile web browsers or something? or rather, taking out? i see he's heading into a meeting right now, otherwise i'd bug him right now here :) [16:00:15] https://codesearch.wmcloud.org/search/?q=replaceState.*wprov&files=&excludeFiles=&repos= [16:00:21] hahaha [16:00:28] usually replaceState unless you want "Back" to go to the current URL instead of the logical previous page the user as on [16:00:30] hi Krinkle [16:00:35] :) [16:01:18] this one is for related pages clickthroughs [16:01:20] * Krinkle jumps back in the bushes [16:02:06] dr0ptp4kt: ohh, i remember whats going on here. search satisfaction cleans it up, but that doesn't load on mobile. [16:03:04] i guess we could move that bit into some higher level code that loads everywhere. That code would probably need to expose the wprov to further code like searchsatisfaction [16:03:24] then have to ensure execution sequence between search satisfaction init and the wprov bits [16:12:57] or can continue the hackiness and have mobile frontend strip it [16:13:48] here's what i was about to say: "right, i see what you mean because we can't clobber that value before intiFromWprov has had a chance to read the thing. or we could do it with minerva if you're unopposed :P" [16:15:30] i dunno if there's a thought to make minerva start using regular cirrussearch for its expression of the frontend. i imagine we'll keep special-casing these search forms. preference on how to proceed? [16:16:32] (point being that if it were hooked up to minerva yes it could someday clobber the thing; but if not then not, theoretically) [16:18:13] i suspect we are the only ones that actually tried to do anything with wprov, and really the only thing we are doing with it is attaching the info to other events so we don't have to do a backend join. Plausibly it can simply strip in mobile frontend and be good enough [16:18:20] anything with it in js i mean [16:21:54] inflatador: let me know how the tests went and if I can deploy the latest spicerack onto cumin1002 too or we need some last minute fixes [16:24:58] volans I'm running a restart of prod codfw now. Assuming that finishes (in ~90m) you should be good to apply. Will hit you up when it's done, or you can spy on the my curator tmux window on cumin2002 [16:25:23] I'm getting the phab updates of the cookbook :) [16:25:33] but didn't know how many tests you had to run [16:25:40] that's great, finger crossed [16:50:20] dinner [17:01:51] pfischer I went ahead and trashed the Google doc for async-profiler, seems like it was more confusing that way. Phab task is here and you should be able to edit it. Hit me up if not https://phabricator.wikimedia.org/T362563 [17:37:14] lunch, back in ~40 [18:20:36] back [18:21:00] volans we're all good w/the spicerack fixes, feel free to merge [18:21:35] ebernhardson: I'm trying to come up with a description of what search previews are for the search slo documentation. I found the following, but when I make that search myself I don't see a UI element that allows me to expand a preview snippet as shown: https://www.mediawiki.org/wiki/Structured_Data_Across_Wikimedia/Search_Improvements#Search_Preview_panel [18:21:58] ryankemper: hmm, looking [18:24:46] ryankemper: i think that might only be found now on media search results, where clicking a page brings up more details to the right. That looks to be called quickview (as opposed to preview) in the mediasearch code [18:24:52] so not 100% sure...hmm [18:28:39] oh, no i'm looking at wrong thing...sec [18:32:04] ryankemper: it's only turned on for a select set of wikis, see wmgUseSearchView in mediawiki-config. Looks like pt ru id ca no hu nl uk [18:32:30] ryankemper: visit https://pt.wikipedia.org/wiki/Special:Search?search=~example and it shows itself [18:32:58] add a uselang=en for english ui [19:18:41] sorry, been back [20:06:56] * ebernhardson turns on saneitizer in sup...lets see how it goes [20:20:11] hmm, nope :P sanity check failed with a missing path in the json response. turning it back off for the moment [20:25:39] * ebernhardson apparently renamed wikiid to wikiId after making the fixtures [20:30:08] but you can't just turn it off...because then there is state with no operator. sigh. Sure there is a way to load it but not sure how yet [20:53:22] ebernhardson: You mean that flag to allow unmapped state? [20:53:43] --set app.job.allowNonRestoredState=true [21:01:39] pfischer: yea i found it after a bit, it's now running normally and i'll ship an updated consumer soon [21:21:47] fixed the parameter camelCasing, and it looks like its working. Will work up some related dashboard bits [21:30:59] sigh..it tried to write a fetch failure, and that failed writing because the stream timestamp is null. will need to figure that out [21:48:57] inflatador: great, thanks, will do cumin1002 tomorrow [21:57:55] ebernhardson: What’s the stream timestamp? Do you have related logs? [22:07:51] pfischer: flink keeps a timestamp, separate from the event itself, in it's metadata. It turns out there was a second output collect method in the RecordEmitter that takes a timestamp [22:07:58] i have a patch, just running tests [22:08:26] the other solution would be to add a timestamp assigner, that extracts it from the event. but seems like might as well set it at the start [22:09:55] ryankemper et all, I have downtimed all wdqs hosts in codfw for the next 2 days due to T362508 [22:09:55] T362508: WDQS updater misbehaving in codfw - https://phabricator.wikimedia.org/T362508 [22:10:55] https://gitlab.wikimedia.org/repos/search-platform/cirrus-streaming-updater/-/merge_requests/new?merge_request%5Bsource_branch%5D=work%2Febernhardson%2Fsaneitizer-fixup [22:11:05] err, wrong window :P [22:14:42] Argh, record is a protected keyword (since java 17 introduced a struct-like class declared as “record MyRecordType” [22:15:00] ebernhardson: ran into this before [22:16:48] oh, indeed. I should have called it event anyways. And i realized i don't need most of the patch, just need to provide the instant. it's up again [22:46:51] hmm, the problem is going to be the bad events are already in the state...i suspect the only option is to clear state and set a kafka timestamp? [23:03:36] looks to be running now, will keep an eye on it. [23:34:51] Sorry, I missed your question. You can launch it with spec.restoreStrategy=NONE, which will fall back to getting the offsets from kafka