[00:16:56] ryankemper interestingly enough, there was an alert for unexpected MSS value on cirrussearch1081 https://alerts.wikimedia.org/?q=alertname%3DFermMSS [00:17:11] you just restarted it, right? [00:18:00] inflatador: yup 1081 was the one I restarted. or rather stopped but puppet wasn't disabled so it's back up now [00:18:55] ryankemper did it join the cluster OK? after split brain it's safer to wipe out its datadir and let it join as a new node [00:20:16] inflatador: nevermind it's not back up now. must have mixed up my tabs [00:21:22] ryankemper ACK, I'd just leave it out and leave off puppet until we figure out the MSS alert thing, probably till tomorrow [00:22:11] inflatador: disabled puppet on 1081 [00:22:18] it's possible I screwed up the hieradata for that hosts' IPIP [00:22:32] but we are good to repool eqiad now, will do that unless you have any concerns [00:23:09] inflatador: we're still yellow status, repooling is fine but might be wise to wait for green while stuff reshuffles [00:23:25] ryankemper ACK, will wait. I'm also gonna wipe the datadir for 1081, better to be safe with that [00:23:32] sounds good [00:24:55] If you wanna work on an incident report, https://wikitech.wikimedia.org/w/index.php?create=Create+report&title=Incidents%2F2025-07-07+Cirrussearch+outage&redirect=no is a good place to start..otherwise it can wait until tomorrow [00:26:39] just downtimed 1081 for the next 12 hrs [00:33:53] ryankemper sorry, I misread your update. I actually am repooling now [00:37:11] BTW, this is the best panel for seeing the changes by pooling/depooling https://grafana.wikimedia.org/goto/7gTI1_UHR?orgId=1 [10:03:16] I got pinged by traffic because cirrussearch1081 is generating alerts due to this MSS inconsistency. Conversation happening here: https://wikimedia.slack.com/archives/C055QGPTC69/p1753177774854319 [10:12:48] I have depooled cirrussearch1081, as pybal probes were failing anyway, so it wasn't really pooled. [10:13:09] Pybal is also expecting a service to be listening on port 9200 on this host and nothing is currently listening on that port. [13:12:02] \o [13:12:04] hmm, we disable dnsdisc for maintenance operations (as it's read-only and maint stuff usually writes/creates), but now the weekly dumps are failing because they use the same stuff...hmm [13:25:28] See: T400158 [13:25:31] T400158: cirrussearch dumps have failed - 2025-07-21 - https://phabricator.wikimedia.org/T400158 [13:26:54] Hi inflatador - I didn't know how best to go about fixing the MSS issue. [13:27:27] Have you any idea? [13:28:15] that is indeed a very weird thing...i don't know much about it but it seems kernel/tcp stack level? [13:28:36] if it's negotiated during tcp, and declares the size of a tcp packet...feels like kernel config [13:32:25] Yes, it's related to the work to update the load-balancer from LVS to IPIP - I've not been very close to it, though. [13:41:05] * ebernhardson realizes after giving up on a patch yesterday....that the failure ws because i spelled "Services" as "Sevices" [13:45:04] :-) We all do it. [13:56:36] Re: MSS, based on the Slack thread ( https://wikimedia.slack.com/archives/C055QGPTC69/p1753177774854319 ) it seems this fires when the node is failing its health checks. So not really an MSS problem, just a slightly confusing alert [14:07:04] ryankemper I created a ticket ( T400160 ) and incident report https://w.wiki/Ep9U for last night's outage. I copy/pasted the incident report from July 7th as it's gonna be very similar. Feel free to work on it as you have time, we can go over together at pairing today [14:07:05] T400160: Write incident report for cirrussearch outage 2025-07-21 23:52 2025-07-22 00:17:45 - https://phabricator.wikimedia.org/T400160 [14:16:24] I'm adding cirrussearch1081 back to the cluster. Time to do some log diving ;) [14:17:17] ooh, also need to re-enable replica allocations [16:01:34] ^^ that's done [16:01:51] ~1700 unassigned shards and dropping [16:05:33] quick break, back in ~20 [16:31:02] back [16:46:15] randomly curious, apparently clipping your dataset to percentiles, like maybe 5th and 95th, before taking a mean is called a winsorized mean. And since this isn't a symmetric distribution, it's "unlikely to produce an unbiaed estimator"...but i'm going to guess it's good enough :P [16:46:49] although i guess i should take it's advice to clip both sides, i was only going to clip the top [16:47:07] * inflatador wishes he knew more about statistics [16:47:41] me too :P I have to search about this stuff and hope i understand what i read. [17:16:58] ebernhardson: looks like winsorizing isn't the same as trimming; rather, you replace the outliers with the 95%ile value (or 5%ile value on the low end).. so you still have the same number of "high" and "low" values, but they aren't wildly excessive. If your long tail on the low end is long enough, winsorizing does nothing.. you replace a bunch of 1s with.. 1s. [17:18:35] Trey314159: hmm, thats basically what Pandas clip claims to do? "Assigns values outside boundary to boundary values." [17:18:50] i guess maybe pandas is being fast-and-loose with the term "clip" [17:19:27] but if i take [1,2,3,4,5] and clip to 3 i get [1,2,3,3,3] [17:20:43] Is anyone interested in working through an online stats course/book? I've been thinking of doing it for a long time... I've had my eye on https://www.openintro.org/book/os/ ... [17:21:16] ebernhardson: the wiki page did use "trim" rather than "clip".. maybe "clip" is ambiguous.... [17:21:58] i've never been good at course work :P but could try [17:23:05] afk a couple [17:45:15] Trey314159 yes [17:49:34] Cool! Maybe in August (when David and Peter are back) we can look into setting up a weekly Stats Chat to work through that course or another one. We should look around a little for material that is relevant to the kind of stuff we do... though maybe a good foundation is a good first step. [17:49:35] I really want something that tells you *how* to design experiments and analyze results... maybe something more in the realm of data analysis would be better. Anyway, I'll add it to a Wednesday meeting in August and we can chat then. [17:50:17] Thanks, I definitely need the foundation (and the structure of a recurring mtg) [18:03:10] hmm, i wonder if it would help deciphering these errors: The data appears to lie in a lower-dimensional subspace of the space in which it is expressed. This has resulted in a singular data covariance matrix, which cannot be treated using the algorithms implemented in `gaussian_kde`. Consider performing principal component analysis / dimensionality reduction and using `gaussian_kde` with [18:03:13] the transformed data. [18:32:08] I do kinda understand that error. Sounds like you have dependent variables in your space (imagine you didn't realize that you had x+3, 2y+x, and x+y as variables.. they are not independent of each other) which makes some of the matrix math impossible (like taking an inverse). [18:32:09] PCA finds new variables (like maybe x and y) and lets you rewrite your data in terms of those variables so that everything is independent. Also, if you have a variable that is mostly or entirely 0, that could cause problems. Not sure what the best real-world approach is with sci_py, though... [18:32:38] TL;DR: yeah, a stats course might help with deciphering errors! [18:34:10] i cheated and just used the mean :P Essentially that error came out of me attempting to run the existing bootstrap/plotting function against the median value intead of the mean. I suspect the problem is that even with 1000 rounds of sampling, the medians are all 9 [19:56:07] taking a break, back in 20-30 [20:01:40] perhaps some time learning better stats would convince me to not use such large samples...ks similiarty, a t-test for "are these two distributions the same" gets a p-value of 6.99e-160. So pretty sure something is different between the two...but the effect size is so tiny it's meaningless [20:11:37] back [20:36:26] 7E-160 is something else! So, it's probably more likely that a cosmic ray flipped a bit during the calculation than that the hypothesis is incorrect. (For scale, there are ~10^80 protons in the observable universe IIRC.) Anyway, I know that there are heuristics for calculating the sample size needed to get detect a particular effect size... but I don't know any off the top of my head. [20:36:36] Of course the assumption that your A & B samples come from the *exact* same distribution is suspect, so knock off a hundred orders of magnitude.. and it doesn't matter even a little! LOL [20:37:56] i do wonder, maybe it would be a reasonable test to verify that the test actually did something. I noticed without running any numbers, but the first version of the test the treatment wasn't applied [20:38:14] also we ran a test with wikidata autocomplete once where, in retrospect, i think the test treatment wasn't applied as well [20:38:50] As a double check i ran the same ks test against the control bucket arbitrarily split into two groups, p-value of 0.945. So it can at least detect "treatment" from "no treatment"