[09:20:39] ryankemper: I have a few comments on T338009. I've pushed it back to in progress [09:20:39] T338009: Create dashboards for Search SLOs - https://phabricator.wikimedia.org/T338009 [10:43:25] errand+lunch [14:00:29] inflatador: is T355617 something to pull into the current milestone as part of T351354 ? [14:00:30] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [14:00:30] T351354: Service implementation for cloudelastic1007-1010 - https://phabricator.wikimedia.org/T351354 [14:01:28] ryankemper: is T355593 something to pull into the current milestone as part of T351650 ? [14:01:29] T355593: Re-generate webserver-misc-apps.discovery.wmnet cergen certificate - https://phabricator.wikimedia.org/T355593 [14:01:29] T351650: Expose 3 new dedicated WDQS endpoints - https://phabricator.wikimedia.org/T351650 [14:17:08] gehel ACK, T355617 should be pulled in [14:17:24] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [14:26:55] o/ [14:28:25] .o/ [14:49:42] til spark dropDuplicates(['col1', 'col2']) is way faster than a "clean" dedup with Window.partitionBy("col1", "col2").orderBy("id") and rank().over(window) == 1 [14:50:04] "faster" for me here is more like does not blow up with OOM [16:01:52] \o [16:02:14] o/ [16:02:26] drop duplicates makes sense in a way, it's about how much work is done. dropDuplicates([col1, col2]) is hashing those two values together and on aggregation doesn't need to keep any extra metadata, just one row per hash. The rank version collects and sorts the full set of rows then selects the first (i think, it could in theory be optimized with a PQ matching the highest rank but it [16:44:31] doesn't look to do that) [16:55:04] workout, back in ~1h [16:56:10] tho dropDuplicates might have been faster I still get OOM when I assemble all the pieces together... running the heavy bits indivually it all pass... :/ [16:59:14] ouch, those are always fun. More partitions, more shuffles, maybe it will be happy ;) [17:02:24] the sad part is that I don't have much data, ~50k rows, but some rows might be big and I run "explode" on some column so spark might not like that :( [17:07:43] can maybe reshuffle after explode to even everything out? I suppose that might not help if they need to be re-grouped and the groups are skewed [17:09:42] Worked with Daniel and Eoghan from collaboration-services to get the new `webserver-misc-apps` cergen cert in place. Also re-enabled the microsites (had disabled to reduce noise from `ProbeDown` alerts) [17:09:56] Everything went swimmingly and we now have the services externally reachable! ex: https://query-full-experimental.wikidata.org/ [17:10:19] if I run count() instead of writing the output to disk it passes... wondering if spark is not attempting some optimizations [17:10:41] ryankemper: woo! \o/ [17:10:47] curious :S [18:18:31] ryankemper YUS! [18:27:11] Is there anything production-critical relying on cross-cluster search or cloudelastic in general? Working on a cloudelastic migration plan for T355617 [18:27:11] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [18:37:12] inflatador: cloudelastic is only writen to from prod. Those writes will back up into the job queue, but only for 15 minutes (iirc, could check in mw-config if important). So nothing in prod should be at risk, but we might lose some writes if it goes on too long [18:38:09] ebernhardson if we're down longer than 15m, will the saneitizer eventually fix it, or will we just lose those writes? [18:38:42] oh interesting, it seems i removed $wgCirrusSearchDropDelayedJobsAfter when removing the frozen indices support, so jobs backlog is unbounded [18:39:54] :/ [18:40:24] i guess that was implemented a bit oddly. It wasn't a generic limit, it was only checked when re-queing a job when the cluster was frozen [18:40:50] could trivially add it back (it's still configured in mw-config). I imagine it would drop the job by checking before doing the action, turning them into noop's [18:41:22] we still drop based on the number of retries? [18:41:31] i suppose that was using a specific meaning of "Delayed" and not just generally old. Naming is hard :P [18:41:47] do we need to freeze the cluster before shutting down so we can keep those writes in the job queue? [18:41:47] dcausse: yes errors are hard limited at 4 [18:42:17] inflatador: not anymore, we removed freezing functionality mid 2022 since it was no longer necessary to regularly freeze [18:42:26] Y, I vaguely remember that [18:42:31] inflatador: the summary seems to be, at the moment, jobs will simply backlog with no limits [18:42:46] inflatador: well, i guess it will hit the 4 failures [18:42:54] Good for us, probably not great for job runners ;P [18:43:39] it's probably fine :) [18:44:17] oops, lunch time! Back for pairing in ~30 [18:47:06] i guess if anything it means writes over that time period can't be trusted and we would want to use ForceSearchIndex.php to repeat the writes for that time period [18:48:41] although depending on order-of-operations, switching to streaming updater soo. In theory that should backlog the sink, but might be worth checking what it does if the cluster is offline or otherwise not accepting writes [18:51:50] dinner [19:30:40] ♫ You're a...nut lover/lovin' some nuts too-neigh-t ♫ [19:52:27] sorry, unrelated Vince Offer flashback [21:43:50] top-ranks just gave me some great news re: cloudelastic migration. sounds public IP/private IP communication will work. Should make things way easier [21:44:42] nice [21:47:08] y, about to take a big red pen to that maintenance plan ;) [22:01:02] small CR to change the cloudelastic masters if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/992538 [22:36:31] hm...cloudelastic hosts will change hostnames from ce.wikimedia.org -> ce.eqiad.wmnet. That means we'll have to do something different with TLS, as I doubt Letsencrypt (current CA) will allow wmnet TLD [22:46:23] probably have to add some ATS config too [22:55:48] This config probably isn't used anymore ;) https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/2f947774567a99d8e2f51adf3c8cbc7404e48c19/hieradata/role/common/elasticsearch/cloudelastic.yaml#101 [23:02:34] hopefully not :) [23:16:40] now would be a great time to move to envoy, but we probably don't want to cross the streams too much ;)