[00:45:56] some curious stats...first draft of ab test report. Still missing some important bits, but getting better: https://people.wikimedia.org/~ebernhardson/T377128-AB-Test-Metrics.html [00:47:16] i suppose a proper analyst would include words that tell you about the graphs, instead of a pile of graphs :P [08:34:26] o/ [08:46:26] o/ [08:52:07] ebernhardson: very nice! I wonder if we should exclude the interleaved bucket [10:36:10] errand+lunch [13:24:26] ebernhardson: Good graphs! A bit more text would help me understand what I'm looking at. [14:24:50] o/ [14:42:12] \o [14:52:42] o/ [14:55:14] What kind of prose should be in the AB test? I took another look over this historical one, mostly only writing up front that describes what/why, but not really analysis: https://analytics.wikimedia.org/published/datasets/discovery/reports/CirrusSearch_MLR_AB_test_on_18_wikis.html#data_clean-up [15:04:37] indeed there's only a title & subtitle on these graphs... perhaps just a small placeholder to describe the A/B test (at least what A & B are?) [15:05:27] sure, i was intending to fill out the opening a bit more just hadn't got around to it. Can do that. [15:05:53] I've also gone back and forth on the mlr-2024i bucket in main graphs ... one the one hand it's pretty useless, but it's also curious. But then the metrics conflict, sometimes it looks better and sometimes not :P [15:06:35] errand, back in ~30 [15:09:06] yes... I agree that it's interesting to see but can be confusing because it's not really something we can pick as winner [15:09:19] i mean, we could make that the default config :P [15:09:27] but yea probably not [15:10:10] interleaved as the default that could be fun :) [15:10:16] quite costly tho :) [15:10:24] i'm also left curious how clickthrough rates are better on interleaved than either of the input rankers [15:10:43] they get the best of both world? [15:11:11] hmm, i guess maybe [15:25:23] back [16:07:04] ebernhardson: more text would make more sense once we have a real A/B and we need to interpret the results. At this point, a bit of description of what each graph is might be useful. For example, on the ZRR section, we expect lower numbers to be better, but only within reason. Some explanation of the query vs session could be useful (it's in the title, but easy to miss) [17:23:32] the python k8s client tries to parse any data it receives as json (raw data is returned if it fails parsing) and then cast to the return type expected by the client [17:24:13] meaning that if your script output a json doc you'll receive a string like "{'foo': 'bar'}" (basically the string representation of a dictionnary)... [17:26:46] heh, wonderful [17:28:31] ah seems like there's a hidden param '_preload_content': False that might help to skip this bold response type strategy [17:56:40] hm and I think that mwscript-k8s does not set network rules properly, cirrus maint script fails with "Http error communicating with Elasticsearch: Couldn't connect to host, Elasticsearch down?." [17:57:10] will debug this more tomorrow... [17:57:12] dinner [17:58:46] workout, back in ~40 [18:17:03] hmm, left wondering what the right representation of clicked position is. Right now i have first-click and last-click. But i'm not convinced having both is useful [18:17:07] will just leave them for now :P [18:42:27] back [19:03:06] hmm, i wonder if this is a sign of a bad graph. My description "We don't have a particular theory about what changes to this metric mean with respect to the ranker under test. This metric is retained partially as a curiosity" [19:03:17] it's the session length metric :P [19:04:42] doing some verification work on T364233. I ran the sample query linked in the ticket against https://commons-query.wikimedia.org and I got results. Does that confirm that the allow-listing worked, or do I need a better test? [19:04:42] T364233: add https://imagehash-sparql.wmcloud.org/sparql endpoint to wikidata federated query whitelists - https://phabricator.wikimedia.org/T364233 [19:05:43] inflatador: yea the example from Zache for commons-query seems fine to me [19:07:01] ebernhardson ACK, thanks. I'll update the user [19:24:00] lunch, back in ~40 [19:29:14] inflatador: yes, it looks like that endpoint is now allowed. I think we use the same allow-list configuration for wdqs and wcqs, so it's probably that we just did not reload wcqs. [19:29:34] Actually, maybe not, I think the allow-list is automatically reloaded every 5 minutes... [19:32:28] And no again. We automatically reload the throttling configuration, but not the allow list. The allow list goes through a Blazegraph service that is started only at the start. [20:01:30] back [21:21:18] ebernhardson they're pinging ya in #operations. I assume it won't do much good if you aren't answering there, but FYI [22:22:26] turns out having more screen real estate than i can see at once isn't always great :P I have irc up on the top monitor and i've been deep into the jupyter notebook that i have on the laptop screen [22:34:49] yeah, it's a constant struggle to maintain focus and allow the right kind of interruptions. Pretty much every modern application is working against you on that one ;(