[08:04:53] dcausse: good writing on https://www.wikidata.org/wiki/Wikidata:Request_a_query#Help_with_WDGS ! It seems to be super useful to our users! [09:47:19] dcausse: Jenkins runs into test errors of wikibase related code when building CirrusSearch: https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php74-noselenium/39696/console and they seem unrelated to my changes but show up in the latest Wikibase CRs https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1072510 - can we ignore them? [09:57:28] pfischer: looking [09:59:24] pfischer: should be fixed soon T374690 [09:59:25] T374690: Wikibase CI failure in OutputPageEditabilityTest "RuntimeException: Database backend disabled" - https://phabricator.wikimedia.org/T374690 [09:59:55] it's unrelated to cirrus indeed, these global CI failures are fixed pretty quickly generally [10:00:25] yes there's already a patch up https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1072715 [10:00:48] so once merged you can just comment your CR with "recheck" it should tell jenkins to retry the build [10:29:43] lunch [11:19:59] dcausse: Thanks, all three CRs should now only be blocked by the pending fix you mentioned. I’ll be out for today. Have a nice weekend! [13:03:15] thanks, take care! [13:15:10] o/ [13:39:31] \o [13:41:40] o/ [13:42:12] bad idea of the day: gergo was a little surprised we do 600 req/s to mw. We could do 400 if cloudelastic ran behind eqiad and did doc fetches from eqiad instead of mediawiki [13:42:41] probably not worthwhile, just something that occured to me :) [13:43:07] .o/ [13:43:41] you mean fetch docs from production-search@eqiad ? [13:43:44] yea [13:44:34] that means "synchronizing" the two jobs somehow [13:45:07] re: https://phabricator.wikimedia.org/T373935#10142745 , graph split hosts aren't in the network path for categories, are they? [13:45:09] i was thinking maybe just a delay, have cloudelastic wait an extra 5 minutes, but indeed no guarantee there [13:46:22] inflatador: sadly they are (was working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1070958 to drop them from split hosts) [13:48:03] ebernhardson: I have a feeling that it's going to be a bit messy but why not :) [13:48:43] dcausse: as i said, bad idea :) Gergo was just surprised we do 600 req/s, but if we get more complaints might be something to think more about [13:49:43] sure, I think memcached is helping a bit there to avoid the costly part [13:50:21] yea hopefully that covers most of it [13:50:43] other random idea would be an http proxy service that sits in the middle and caches responses [13:50:47] also we'll be running single dc soon meaning eqiad&cloudelastic will both fetch from mw@codfw, so all 3 consumers hitting the same cluster [13:51:36] probably the proxy is a little less crazy i suppose [13:51:37] yes I think a cache in between seems cleaner but that's obviously more pieces in the mix [13:55:24] dcausse interesting, would they have to use `query-"(main|scholarly)".wikidata.org/bigdata/namespace/categories` to get to a graph split host? It seems like they couldn't get there just by using `query.wikidata.org` but it's possible I'm missing something [13:59:52] inflatador: absolutely that's why it's pointless to have them there, nothing is pointing at them [14:00:10] what does action=submit usually do on a mediawiki page? edit's are usually action=edit [14:00:38] action=edit enters the edit page no? [14:01:36] hmm, yea that makes sense. i updated the aggregated long-timeouts to include url/referrer and they are mostly either edit, submit, or api.php: https://phabricator.wikimedia.org/P69110 [14:01:58] but some are referrer: submit url: submit [14:02:17] i suppose that could happen if mw refuses the edit for some reason and they submit again [14:03:20] ah perhaps? [14:05:47] i'm still not sure what we do about this though, do we hope platform/serviceops prioritize the problem? It's on average <10 req's per day that timeout this way, most seem to be edits so we probably lose events [14:06:43] i worry that with the low frequency it's a problem, but perhaps not a big enough one [14:08:56] yes... I guess they'll will do a ratio with the number of successful edits and might consider this acceptable, or not? [14:09:37] but it's kind of annoying that this missed event just happened on a bug we supposedly fixed :) [14:09:48] per the headline dashboard on grafana we have ~10 successful wiki edits/sec over the last 24 hours, so it's erroring on 1s worth of edits per day [14:10:48] lol, yes it is a bit tedious. The bug fixes do mean that saneitizer should now fix these moves i hope [14:11:25] but i can understand that for editors 2 weeks isn't exactly a quick fix [14:12:27] 1s of edits seems like a .00011% miss rate if my math are correct [14:13:22] also our slo doesn't capture these kinds of "document not updated on time" errors, but not sure how it possibly could [14:14:06] true... [14:21:57] re: categories, if we get rid of it on graph split hosts, and we completely migrate off of full graph hosts, that implies we'll need to run categories somewhere else [14:22:38] inflatador: absolutely, mentionned that to Guillaume a couple days ago [14:23:14] we can re-add them back... or we could prioritize moving them off somewhere else [14:23:57] but yes all that to say that they're going to annoy us for quite some time still :) [14:29:42] dcausse ACK, looks like you've already done a lot to decouple categories (puppet/rdf patches). So I guess that means we will eventually move categories to its own infra (which is totally fine w/me). [14:30:32] gehel do we need to clear the decision to move categories to its own infra w/any other stakeholders? If not I'd be fine w/closing T374016 with an update that we're definitely moving it [14:30:33] T374016: Consider separating wdqs-categories from the rest of the wdqs stack - https://phabricator.wikimedia.org/T374016 [14:31:20] inflatador: I think that's ultimately what we want but up to you, I feel that it's mostly painful for SREs with annoying alerts and noise of the like [14:34:54] dcausse it makes sense to move it, although rewriting the puppet code is its own pain (that luckily you have suffered more than me ;P ) . I'm OK w/moving it unless anyone else has objections [14:35:14] I'm all for it [14:36:05] ryankemper ^^ any objections to moving categories to its own infra? [14:53:47] workout, back in ~40 [15:03:42] I'd like to test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1072759 [15:04:22] Janis have been told by Balthazar that we ought to be able to get rid of this list of broker ips if we had a recent enough kafka client [15:05:29] this ns entries seem to work on the pods and properly list the kafka nodes [15:45:41] ahh, that makes sense. I kinda expected we should be able to use some discovery address since kafka bootstraps from the node it first connects to to find the rest of the cluster [15:50:13] back [15:50:29] dcausse if I can help test above patch LMK [15:52:45] inflatador: thanks for the offer, I'll deploy this now and see how this works in staging, won't be deploying anything to prod today tho [15:54:01] ACK [15:59:33] does not work, it's failing the ssl handshake [16:03:45] that's odd [16:04:19] I guess the kafka side doesn't have a SAN for discovery record? [16:06:35] unsure when it happens... [16:06:38] No subject alternative DNS name matching 10-64-48-31.kafka-main-eqiad.external-services.svc.cluster.local [16:07:04] my understanding is that first it should get ips doing a nslookup on kafka-main-eqiad.external-services.svc.cluster.local [16:07:13] and then resolve hostnames from those ips [16:07:23] :S [16:07:36] but perhaps those ips are resolved to something else within k8s [16:08:16] host kafka-main-eqiad.external-services.svc.cluster.local resolves to 10.64.16.37, 10.64.48.30, ... in the pod [16:08:52] but the reverse lookup host 10.64.16.37 resolves to 10-64-16-37.kafka-main-eqiad.external-services.svc.cluster.local. [16:09:20] I think it should resolve to kafka-main1002.eqiad.wmnet [16:31:49] going offline, have a nice week-end [16:56:22] lunch [17:11:16] hmm, was curious about the aggregate over everything question...found a way to do it with pagination but it's way too slow. ~35ms per reqId, and there are 57M reqId's for that day. Works out to 23 days to query them all [17:11:40] might be a better way source reqId's with a scroll instead repeated search_after's though [17:22:52] ebernhardson: so, cwhite helped me figure this out for traces -- and I've failed to find any traceIds with spans that are more than 20 seconds apart [17:23:12] https://phabricator.wikimedia.org/P69117 [17:24:14] cdanis: interesting, i'm suprised the bucket selector can be fast enough for that [17:24:26] there are probably a lot fewer spans than logs [17:25:22] cdanis: ahh, yea perhaps. The index i'm testing against has ~70M logs (surprisingly less than i would have expected) [17:26:03] yeah there's only about 10M spans [17:26:05] every day [17:26:06] we have search indexes larger than that :P [17:26:27] commonswiki_file is ~150M docs and 1.4tb (4.2 w/replication) :S [17:26:29] mhm [17:30:26] one interesting option would be to dump the logstash servers into hadoop, i have a script that we use to do the same for wiki's. Would have to be able to query logstash hosts from hadoop though which i suspect is currently firewalls [17:31:17] but then you can throw a few thousand cores in parallel at the problem of aggregatoin [17:33:54] https://phabricator.wikimedia.org/T291645 <- Integrate Event Platform and ECS logs [17:34:45] this would be a valuable thing to do so that when this kind of thing happens the data is jsut there [17:34:52] ottomata: nice! although with a date of 2021 i don't have a lot of hope for movement :P [17:35:10] well, the thing is, this is the kind of thing that I need engineers like you and cdanis to be squeaky about [17:35:27] managers prioritizing this stuff don't see the cross team value [17:35:40] I've been squeaky about that one a few times ottomata :) [17:35:43] we had meetings with o11y in 2021 about this [17:35:50] how do we get these squeaks to add up tho?! [17:35:52] and be loud?! [17:36:52] well, i posted a comment about the current thing i'm looking into. Not sure how much that helps :) [17:37:17] this is similar to https://phabricator.wikimedia.org/T355837#9813506 [17:37:36] i'd love to make a openmetric / prometheus compatible event schema, and use that as the bridge between operational metrics and data lake [17:38:26] thansk ebernhardson that does help [17:59:29] inflatador: yeah we def want categories on its own infra [18:41:51] back [20:23:15] gotta love "universal serial bus - type c" where just because it plugs in doesn't mean it will work [20:23:24] really killed the universal :P