[07:36:54] ebernhardson: I'll ask Joseph but iirc there two layers of access in the datalake one is allowed for private data but not the other [07:37:27] I think everything that's not pii should def be world readable imo [07:40:11] re rdf subgraph metrics, I got one question from someone external to the team a couple months ago "Does anyone know how one would be able to query discovery.general_subgraph_metrics ?" was very surprised... so I guess yes we should probably include them :) [07:40:52] we also made those world-readable [09:32:31] errand+lunch [13:14:27] o/ [14:11:51] \o [14:12:49] dcausse: thanks! [14:12:59] o/ [14:13:13] ebernhardson: so Joseph asked me to file a task for this [14:14:22] i was guessing everything in datahub is private-allowed, since it has the SSO sign-in in front of it [14:15:54] oh, but i guess there is the distinction of if they have the analytics-privatedata-users [14:17:04] basically saying world redable is obviously OK for non PII, for PII datasets we need more discussion but possibly we can leverage hdfs acl to add more fine graine group perms to our parent folders (https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html) [14:17:50] or make everyone part of disco-privatedata but that's less ideal [14:18:09] yea makes sense, i'll read up on that today and see what makes sense [14:18:54] he said that it's possible that subfolders will inherit those so ideally we can set them only once? [14:20:52] i was curious about that too, otto said before subdirs should inherit parent dir permissions, in a quick test using `hdfs dfs -mkdir ...` that didn't quite seem to be generally true, a parent dir thats 777 didn't cause subdirs to be 777, but certainly the other way around (more restrictive parent dir) would work [14:21:06] maybe spark is different, i should test that directly i suppose [14:23:01] but if we set the privatedata limits on the parent dir of each dataset, we should then be able to apply the config that makes newly created dirs be 777 and be ok [14:27:28] oh haven't thought about that [14:27:43] will try to get that ticket created [14:56:34] Won't be at search retro today [14:59:35] gehel: i'll be a couple minutes after start of retro, brief break following meeting just concluded [15:49:59] random thoughts from a thread in slack, it seems it would be useful for image suggestions if they could have some sort of search engine keyword that filtered for pages without images [15:50:17] maybe a simple image count as an indexed field, although i don't know the count has other use cases [15:50:23] 💡 [15:50:24] or indexing the image names themselves [15:51:51] i imagine adding an image count is reasonably easy, although i haven't checked. [15:53:14] i could also imagine though that the needs are a little more strict than "image or not"...the size or location of image might be relevant, unknown [16:01:05] workout, back in ~40 [16:01:41] sure I don't see why not, devil might be the details I suppose, say there's a popular template that includes a small image (like a small icon) you might perhaps not be interested in thos [16:01:51] e [16:03:24] or perhaps simply indexing the "page image" of https://en.wikipedia.org/wiki/Image?action=info [16:04:09] yea, perhaps indexing the page image [16:10:02] re hdfs perms, apparently perms under /wmf/data/discovery/ are not very consistent [16:10:27] some are world-readable some not [16:10:50] yea it's quite inconsistent and i'm not entirely sure why, part of why i was thinking to clean it up and make it all the same [16:11:01] some are group writable some not, not sure how ended up in that situation [16:11:10] yes... [16:14:45] looking into T331127, seems like a the page move was not marked as page_change_kind: move but "edit" for the "Citrivescence moved page [[User:Citrivescence/Asher Perlman]] to [[Asher Perlman]]: Move to mainspace" [16:14:46] T331127: phantom redirects lingering in incategory searches after page moves - https://phabricator.wikimedia.org/T331127 [16:15:13] hmm, not good :S [16:15:13] so the cleanup we do in the sup is not going to work... [16:19:46] i suppose we need to understand how those events are made, and fix event generation? [16:20:25] maybe some sort of test suite that edits via api and verifies the events we need come out the other end, but maybe thats a bit much [16:24:43] yes... I'm sure I've seen page_moves at some point tho... [16:25:14] the wdqs had a test suite making edit to test.wikidata but we removed this, it was not super reliable :) [16:28:16] i don't know if it's definitive, but checking 100k events from page_change.v1 not 1 matches jq 'select(.changelog_kind == "move")' [16:28:39] i guess can let it check the topic from beginning [16:29:37] :/ [16:29:42] up to 1.3M, no matches [16:30:49] oh wait, we use page_change_kind, rechecking [16:31:02] that has matches [16:31:03] something broke recently or I hallucinated... [16:31:08] ah [16:32:05] yea, 25 moves in the last 10k events, so they come through [16:33:43] maybe there is something special about cross-ns moves? Shouldn't be though [16:38:16] and it should have an event for the redirect that it's leaving behind, if I'm not mistaken, not finding this event either filtering on request_id... [16:38:42] dcausse: they might not leave a redirect, that's a move option. Moving from user namespace to main i could see the user unselecting it [16:39:08] here it's there https://en.wikipedia.org/w/index.php?title=User:Citrivescence/Asher_Perlman&redirect=no [16:39:28] i just did a test edit on test.wikipedia.org, move from User:EBernhardson (WMF)/Test to the main namespace it it emitted teh move and create events [16:39:44] meh [16:40:42] seems like it's going to be fun to debug :/ [16:41:22] indeed [16:41:51] back [16:45:33] are there other ways to move pages? I would hope not... [16:47:30] checking logstash for the request id of the move, only one log. "Inconsistent revision ID" from ParserCache [16:47:45] weird... [16:47:57] perhaps Draft -> NS_MAIN has some special tools? [16:48:10] dcausse: curiously, the referer on that was https://en.wikipedia.org/w/index.php?title=Asher_Perlman&action=submit and the url was /w/index.php?action=submit&title=Asher_Perlman [16:49:38] but the comment is absolutely "Citrivescence moved page [[User:Citrivescence/Asher Perlman]] to [[Asher Perlman]]: Move to mainspace" [16:49:39] how is this possible? [16:50:07] I suppose the comment could be manually written... [16:50:36] but the page_id is the same for events prior to the move [16:50:56] i'm not sure how that's possible [16:51:16] so this must be a page_move not a "manual" move copy/pasting the draft content into a new page [16:53:03] oh, actually i'm looking at the wrong one. That comment was in the prior_state field... [16:53:44] oh [16:54:11] so that was the event immediately after the move, i don't actually see an edit that has the move comment [16:55:25] checking for events matching revid 1244287609 [16:55:47] yes so that's "simply" a missed event [16:56:57] looking again with a direct grep for that number out of kafkacat...but i dont have it in my set of events that contained Asher_Perlman [16:57:44] yea, i only see it in rev_parent_id, not as rev_id [16:57:57] looking at other streams [16:58:08] heads up ebernhardson dcausse pfischer i'll plan to file some tasks on data spelunking for unsatisfied search and low-ctr related articles stuff [16:58:22] thx! [16:59:36] i'm not sure what we do about missing events...but they will be a signficant issue for us [17:00:03] ottomata: if you have ideas? ^^ [17:00:27] i suppose nicer might be to summarize in the ticket [17:05:12] looking at "Unable to deliver all events" in logstash is a bit worrisome [17:05:34] OK, the calico patch is ready...following Balthazar's advice of hard-coding the networkpolicy selector https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1072236/3/charts/flink-app/templates/networkpolicy.yaml#91 [17:07:38] dcausse: curiously though, there isn't one of the logs at 2024-09-06T04:56. But there are >5k of them that day :S [17:07:46] true... [17:07:49] oops, patchset 4 has the comment v [17:07:51] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1072236/4/charts/flink-app/templates/networkpolicy.yaml [17:08:45] no event in revision-create [17:14:05] noting in page-move either, I guess we only have webrequest to attempt to find the request_id? [17:15:18] in logstash i found reqId bbebe293-bf71-43a9-ab67-1c8e12e2453b for Special:PageMove referred from the source page, but only log is an empty cache key [17:15:42] err, Special:MovePage [17:15:47] ok [17:20:49] ebernhardson: searching this req_id I find things like "Cannot execute Wikimedia\Rdbms\Database::commit critical section while session state is out of sync. A critical section" [17:21:57] but at 2024-09-06T07:00:04+00:00..., referer https://en.wikipedia.org/wiki/Special:MovePage/User:Citrivescence/Asher_Perlman [17:22:36] dcausse: hmm, maybe i'm not looking in the right spots, i'm on the general `mediawiki` log dashboard [17:23:11] ebernhardson: perhaps the time selector is too narrow? [17:23:20] oh maybe, checking [17:23:31] that error is like 2 hours later [17:24:01] ahh, so yes now seeing things. 2 hours later would suggest job to me, since we propagate the reqId into those [17:24:34] but these are from mw-web [17:24:38] yes... [17:24:43] same pod [17:25:12] I wonder what is setting the timestamp in logstagh [17:25:19] that is really weird [17:25:41] The edit was marked in mw at 4:51 UTC [17:25:49] I doubt the php request has lived for 2 hours [17:25:53] indeed [17:26:06] so I wonder if it's logstash being late ingesting? [17:26:26] maybe, we might need to ask o11y [17:26:58] i was hoping i could find that in jaeger (although i think it's only sampling?) but no luck [17:27:26] not sure how to search jaeger for a reqId [17:38:24] logs landing late seems unlikely, the server was emitting logs the whole time [17:38:45] could deferred's have got stuck for 2+ hours? also seems highly unlikely [17:42:29] * ebernhardson ponders how to tell if this is something that happens in other places,without blowing up the logstash cluster with crazy aggregations [17:44:49] :) [17:45:32] posted a question in the service-ops channel [17:45:45] i also asked in #mediawiki-core [17:46:07] I guess I should join this channel :) [17:46:32] i should apparently join service-ops :) I'm not even sure where that is [17:46:47] should be #wikimedia-serviceops [17:47:02] ahh, thanks [17:50:45] just sampling reqId's from EmergencyTimeoutException, found reqId "a8ea1282-dc5c-4265-a35d-38374f2731a5" which logged at 5:49, then 7:45 from same server and reqId [17:51:17] looks like an api edit [17:52:08] I wonder what's responsible for killing long php requests, perhaps it has some safe-guard if it's some "critical" section [17:52:31] i wonder if deferred is also realted, since those are post-closing the request [17:53:59] reading https://www.mediawiki.org/wiki/Excimer [17:54:45] "If an Eximer timer, or profile sample period, elapses while the engine is busy running a C function (such as mysqli_connect), then your callback or profile sample will take place immediately after that function completes" [17:55:21] sounds possible [17:55:39] so was perhaps waiting in a mysql call [17:58:32] dinner [17:58:33] they aren't super common, but found another at reqId "bb7130ad-59a0-4075-939b-0c75c35cb2ce". log at 1:55, further logs at 4:25. again same host, same reqid [18:05:45] lunch, back in ~40 [18:55:28] back [18:56:29] not having great luck tracking down much more interesting. Could maybe adjust EmergencyTimeoutException to report that actual time, instead of only the limit, which might make it easier to track down requests that timeout long after they should have, but that's not really solving anything [19:15:58] ebernhardson: jaeger is 0.1% sampled on most mediawikis but if you have a set of reqIds you know I can help you track them down in traces if they exist [19:23:43] cdanis: i have three, but i could probably find more if i wrote some script to pull them out of logstash. It sounds like finding 3 with 0.1% sampling would have to be very lucky :) [19:24:47] 🍀 [19:24:49] the reqId's i have are bb7130ad-59a0-4075-939b-0c75c35cb2ce a8ea1282-dc5c-4265-a35d-38374f2731a5 and bbebe293-bf71-43a9-ab67-1c8e12e2453b . I suppose what i'm trying to figure out is if these requests really got stuck in some mysql connection for 2 hours [19:25:00] (also not sure if sql is in jaeger yet) [19:25:05] probably? [19:25:27] sql is not, sadly [19:25:42] aww, then it probably wouldn't apply here anyways. Appreciate noting though [19:25:51] *noticing [19:26:02] as for hours later, no, that's not possible -- but I am pretty sure that jobqueue jobs keep the reqId of the user-facing request that enqueued them [19:26:20] cdanis: these are web requests that logged messages, and then 2+ hours later logged more messages about timing out [19:26:28] shouldn't be possible, but yet thats what the logs say [19:27:01] https://logstash.wikimedia.org/goto/f5dff72f669ce0fecad15aa8e8b46022 this is the link, right? [19:27:43] yea thats the first one, it's an edit request that created an edit at 4:51, logged one message, then logged timeouts at 7:00 [19:28:23] i looked for the other two just as proof that this wasn't one-off [19:28:24] uhhhh hmm [19:28:37] mw-web.eqiad.main-f7d55ccd6-cnxhg [19:28:39] [19:28:41] yeah okay 😬 [19:29:02] yea i was suspicious the reqId randomly got reused, but seems unlikely [19:29:11] at least, not multiple times in different requests [19:29:24] and not with the same url, referrer, etc [19:29:42] so I actually *don't* know if there's any timeouts on the post-send processing in mediawiki [19:30:05] which is I'm imagining what this must be [19:30:12] yea, that makes the most sense to me as well [19:31:26] for us, the bug is that the post-send actions didn't happen. which causes search index inconsistency, but it seems the root of having requests hang open for hours is also troubling [19:32:41] yeah [19:34:42] any idea where to even route this kind of problem? I probably need to create a dedicated ticket but not sure who is responsible [19:34:54] > The library provides a critical section concept. If a critical section is active, timeouts will be queued, rather than immediately thrown. The timeout exception will be thrown once no critical section is open. [19:35:21] (https://gerrit.wikimedia.org/g/mediawiki/libs/RequestTimeout) [19:35:46] so I'm gonna guess that, if a DeferredUpdate can hold a long-running critical section, maybe this is what's happening [19:36:43] ebernhardson: I would guess probably MW Platform Team (+serviceops for fyi) are the best contacts for this [19:37:33] yea we were guessing similar re- request-timeout, that also uses excimer which in it's docs says it cant issue a timeout if php is in a C function, only when running php code directly, so it is plausibly in a mysql connection (afaik the only thing using CriticalSection) [19:37:46] ah yes [19:37:52] that is also true [19:38:58] well, i'll try and collect better information, also trying to write up something to get a better idea of how often this happens. I sampled ~30 EmergencyTimeoutException's and 3 showed this behaviour, but thats a small sample [19:49:44] btw to check traces you can do a query like this: https://trace.wikimedia.org/search?end=1726170577882000&limit=20&lookback=2d&maxDuration&minDuration&service=mw-web&start=1725997777882000&tags=%7B%22guid%3Ax-request-id%22%3A%22bb7130ad-59a0-4075-939b-0c75c35cb2ce%22%7D [19:50:10] I didn't see any hits for the 3 you cited, not surprising [19:51:01] if you want to do something automated, the backing data is just opensearch, and https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070920 should give you some ideas of how to work with it [19:51:44] ahh, x-request-id, that makes sense. I had tried requestId with no luck, but didn't know if it was because i was doing it wrong or it was the wrong query [19:51:55] i suppose makes sense since thats the header we pass [20:06:30] yeah, and it's parsed out by Envoy, not by our own code [20:09:30] ebernhardson: it might be possible to query opensearch for traces that contain spans that are hours apart? [20:09:41] I think you'd do better at that than I would though 😅 [20:10:29] cdanis: i did look into that, but wasn't having a bunch of luck. What i'm doing now is first issuing a query to source reqId's for logs that have EmergencyTimeoutException in them, then running a second aggregation query that sources the min and max @timestamp value [20:10:44] got it [20:11:49] cdanis: found at least 10, but this was with aggregation stopping at 10: https://phabricator.wikimedia.org/P69107 [20:12:20] but also i haven't yet limited those properly, they could come from jobs. Still tweaking :) [20:12:52] ebernhardson: can you put your query in a paste? [20:15:02] cdanis: probably this one (the aggregation), it's not too complex. There is some jq in the middle that transforms source reqId's into this and transforms the output into what i pasted: https://phabricator.wikimedia.org/P69108 [20:16:25] it's mildly annoying that i can't paste a list of reqId's and instead need the bool query, might be another way but not thinking of it at the moment. I guess a giant query_string query could do it as a OR b OR c OR ... [20:16:33] but it's easy to make jq output the array [20:16:54] shouldn't you be able to do a `terms` on `reqId.keyword` ? [20:17:31] cdanis: i thought so too, but opensearch says i can't do that: [terms] query does not support [reqId.keyword] [20:17:36] ah [20:57:14] cdanis: the full script that ended up working looks like this, runs in an empty directory: https://phabricator.wikimedia.org/P69109 [21:03:27] ah I see, yeah [21:03:44] do you know of a way in elasticsearch to like ... tell it to please actually scan the full table to compute some aggregates? [21:04:57] cdanis: not sure, we basically don't use any aggregations in search. I guess i'm being hopeful that the 100000 is enough to source all the input messages since its generally not too many requests per day that timeout this way [21:05:03] yeah [21:06:17] I was trying that to look for long-running traces, but it's harder than I know how [21:06:49] I'm out of time for today though -- if there's a tracking task or something for your investigations ebernhardson can you cc me on it? [21:07:14] i'll cc you, i'm still in the middle of writing the task. our task is T331127 [21:07:15] T331127: phantom redirects lingering in incategory searches after page moves - https://phabricator.wikimedia.org/T331127 [21:07:32] it'll be a subtask [21:07:36] thanks!