[09:56:23] lunch [14:10:51] \o [14:12:57] o/ [14:23:22] \o [14:37:44] o/ [14:47:06] meh..realized on review i set public dirs to 770 / files to 660. That's still group writable :P one more time... [14:47:24] should be 750 / 640 [14:54:30] err, still wrong :S "private" should be 750/640, public should be 755/644 [15:02:38] ebernhardson: meeting? [15:02:42] doh, sec [16:32:13] SRE meeting wrapped up, going to take the puppy out for an adventure and hopefully tire him out enough for him to let me get some deep work done :P [16:43:40] dinner [16:46:30] hmm, cindy setup fails with `Bot password creation failed. Does this appid already exist for the user perhaps?` [17:15:41] oddly, the database isn't being destroyed when deleting and recreating the mysql container...still not sure why yet [17:24:27] ohh, disks are almost full, running some commands made them fill the rest of the way. Maybe just some oddity after something failed due to full disks [17:35:34] nope, not it :P but `docker volume prune -f` got the job done. Not sure what changed :S Seems a bit strong to put in the setup scripts though. [17:51:14] and the answer was ... the error removing the network was hiding that volumes are cleaned up after the network failed to be removed, so it didn't do it. Possibly related to the disk-full, needed to restart the docker service before it would allow the network to be deleted. [19:03:47] ebernhardson low priority, but I set T373895 to "needs review" pending your confirmation [19:07:49] T373895: Reduce frequency of garbage collection alerts on cloudelastic - https://phabricator.wikimedia.org/T373895 [19:08:11] can glance at the dashboards [19:09:40] * inflatador should probably have done that too [19:09:55] they all look plenty reasonable [19:10:45] cool...just looking at "GC runs stats" and friends from https://grafana.wikimedia.org/goto/mTWnrRgNR?orgId=1 ? LMK if there's another place to check [19:11:06] might be easier if we had a dashboard that showed a graph per host, instead of stepping through the hosts, but it still works [19:11:16] (at least, on cloudelastic where there aren't many) [19:12:30] yeah...I can look into making a combined panel [19:43:16] https://grafana.wikimedia.org/goto/-6C3CggHg?orgId=1 WIP panel for Elastic GC by cluster, feel free to edit...if it looks good I'll apply this to other GC runs panels [19:45:30] inflatador: hmm, from here it looks same? [19:45:41] oh, it's summed [19:46:14] but yea, that should still show when old gc goes crazy on us [19:46:16] ebernhardson yeah, I'm flexible on that...played around with sum and avg, feel free to edit to a more useful visualization [19:46:54] inflatador: the scale seems wrong though, the GC run stats next to it is 0.05-0.1/s, this one is also 0.5-0.1/s [19:47:24] it clearly says sum though, hmm [19:48:14] the visualization type is "time series" as opposed to "graph (old)" for the old panel, not sure if that affects anything [19:50:53] still not quite sure.. if i `sum by (gc) (rate(...))` the numbers look more reasonable, but i can't explain why that's different from sum(rate(..., gc="old")) [19:53:16] I vaguely remember some warnings about doing it the way I'm doing it...something about the results of instant vectors maybe? [19:53:25] anyway, I'll change to the `sum by (gc)` approach [20:01:57] inflatador: oh, i see the difference. The bigger number is young, and on young you have quantile(0.5, rate(...)), so young was getting averaged rather than summed [20:02:16] inflatador: with the sum by (gc) (...) method you shouldn't need separate definitions, it will generate young and old in one go [20:02:44] i was only looking at the top line which was old and had the sum on it [20:03:09] (but that is so small that the per/sec rates are mostly 0, the sum's don't line up on the same time period) [20:03:54] thats also why we added the /hour graps, more meaningful when looking at the old gc which shouldn't trigger more than a few times an hour [20:05:38] cool, I removed the "B" query completely per your advice [22:40:01] ebernhardson: Thank you for reviewing https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1067928 - I just checked what’s the next backport window (during which I am around) and noticed that Wed - Thu are planned to role out 1.43-wmf.23. Shall we get it on that train or would that change wait for the next train? Do we have to wait for a train anyways?