[11:07:43] lunch [14:55:06] I'd like to upgrade the search airflow instance tomorrow, if possible. This is for T335261. It will require about 30 minutes of downtime for the airflow scheduler, during which time DAGs will not run. [14:55:07] T335261: Upgrade Airflow instances to Bullseye - https://phabricator.wikimedia.org/T335261 [14:55:57] dcausse, ebernhardson ^ [14:56:04] Is there a particular time of day when this work would be most convenient, or would you like to carry out any work yourselves to prepare for the operations work? [14:56:53] btullis: not really I think any time would work [14:57:51] btullis: as for the prep if you let me know once its back I can double check that everything is back up [15:00:47] dcausse: Great, thanks. I'll pencil it in for 10:15 UTC tomorrow morning, if that's OK with you. [15:01:07] btullis: all good I'll be around [15:01:15] 👍 [15:49:43] dr0ptp4kt: started to collect query results for the new 100k sample from Andrew (collecting full and main in parallel at 5 concurrent queries) [15:50:34] thx dcausse [15:50:51] Are we skipping the triage meeting for the DPE staff meeting? [15:55:17] \o [15:55:19] good question [15:56:49] o/ [16:01:42] last minute, but yes, I've canceled the Search triage [16:02:44] dcausse: the slides have moved to keep the Q2 vs Q3 format, I'm not sure that entirely make sense. I think it's best if you talk about the graph split as a whole and then I take over with the rest [16:02:58] or if it is too confusing, I'll take ove! [16:03:04] s/ove/over/ [16:03:22] gehel: just let me know once I have to talk about it :) [16:03:30] :) [17:43:59] randomly curious, our pool counter dashboard for completion shows daily request rates of 650/s-1200/s. the main percentiles dashboard shows 350/s-900/s. multiple pool counter invocations per request? [18:03:13] strange... there might be multiple searches because of variants but I believe we made it a way that it ships a single multi request [18:03:28] language variants I mean [18:03:47] oh! yes now i remember the variants. I tried to refactor it in core one time...and gave up :P [18:03:54] :) [18:07:31] curiously, today there is a SearchGetNearMatchBefore hook, in addition to SearchGetNearMatch. We only implement the second, which is called once per variant [18:19:51] more curiously, that hook was added in 2010 (and git log -S took forever to find it :P) [18:20:45] ouch way before CirrusSearch... [18:23:25] indeed, a curious choice [18:30:54] * ebernhardson separately wonders how we should exclude private wikis from the consumers [18:32:51] indeed... providing the explicit list of public wikis is not a solution... [18:34:33] perhaps another "small" filter calling the config api? but even that might fail cirrus-dump-config might simply fail on private wikis without the new NetworkSession extension you wrote [18:35:36] yea, we will simply get a readapidenied error message. We could toggle on that, but seems iffy [18:36:42] will have to ponder...cloudelastic writes need some sort of reasonable guarantee about what wikis it should write to [18:40:01] well for now there can't be any events from private wikis (https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/517e938de19d7bd6f3d39e5cf6fb03a254167d72/wmf-config/InitialiseSettings.php#11217) but we might need stronger guarantee esp. for cloudelastic indeed [18:40:59] depending on how events from private wikis will be emitted we'll have to keep this information around [18:42:30] yea, i suppose we should flag it from the source events [18:46:20] btw i'm removing consumer-devnull, expanding prod producers to all events, and expanding cloudelastic to the prior test list so ~25% of cloudelastic writes from SUP [18:46:35] +1 [18:57:39] o/ [19:13:02] dinner [19:25:21] I have a CR up for adding the private-IP'd cloudelastic host back to the CE cluster...please let me know if you see any potential dangers with this plan. I was thinking that if the host fails to join the cluster, it wouldn't be much of a problem. It's also banned for the time being so we can test connectivity before adding it back [19:25:28] CR here: https://gerrit.wikimedia.org/r/c/operations/puppet/+/993764 [19:56:49] medical appointment, back in ~1h [21:08:48] back [21:43:53] huh..unexpected. If you mean to query prometheus with `sum by (group, topic) (metric{exported_cluster="main-eqiad"})` but somehow leave out the metric name and query `sum by (group, topic) ({exported_cluster="main-eqiad"})` it will still run and apparently check all known metrics? [23:47:08] found problem with saneitizer, easy patch up to fix. Curious problem though, somehow the prod clusters have a `default` cluster defined in $wgCirrusSearchClusters that points at localhost, even though we never configure it in mw-config. [23:47:39] once saneitizer is fixed each job will attempt to check that and then log a failure...will probably want to track down