[09:45:28] oh nice. nrpe::monitor_systemd_unit_state accepts a `contact_group` param... and then completely ignores it [09:52:47] it's just to make you feel better :) [09:53:55] it didn't _quite_ manage that ;) [10:09:45] pcc (puppet catalog compiler) does it work on stacked patches? [10:09:50] ^ anyone knows? [10:23:34] <_joe_> yes [10:23:46] <_joe_> it does git checkout of the whole patch tree [10:24:20] <_joe_> which also means you should always rebase your patch tree on top of master before running the compiler [10:24:34] <_joe_> I mean rebase and update, ofc [10:38:02] ack thanks [11:50:05] apergos: thanks for merging the patch <3 [11:50:19] thanks for making it! [12:37:53] _joe_: how does the puppet style guide feel about roles including other roles? [12:40:25] <_joe_> kormat: generally frowned upon, but IIRC is allowed [12:41:27] _joe_: how about roles that include a single profile, which then includes a bunch of other profiles? [12:42:41] <_joe_> kormat: view it like this: a profile should provide some functionality, and you should be able to compose multiple profiles into what goes onto a series of servers as a role [12:43:23] i'm thinking of >1 role that only include (the same) single profile [12:44:09] <_joe_> that might work in some cases, think of the appservers [12:44:23] <_joe_> we have different roles for apis and appservers, but they include the same profiles [12:44:55] ok cool. i'll mention in the CR that you signed off on this. thanks! [12:45:32] <_joe_> *cough* [12:46:14] "It's not a Stevie puppet CR unless it causes a constitutional puppet-style crisis" (™) [12:47:20] <_joe_> nah it's just you ask abstract questions that have as a general answer "it depends" [12:47:54] _joe_: https://phabricator.wikimedia.org/T285390 is the concrete situation [12:48:43] <_joe_> kormat: yes, that's literally the same reason why we have different roles for apis and appservers [12:49:03] <_joe_> they are configured differently, so it's easy to add automation/distinguish them in hiera [12:52:09] oh nice role::mariadb::core already includes role::mariadb::ferm. [13:10:45] this is awesome! https://timerelengteam.toolforge.org/ is it a general thing? [13:16:19] it does noot seem so xd [13:17:46] <_joe_> dcaro: we got https://people.wikimedia.org/~cdanis/sremap/ (nda login required) [13:18:03] <_joe_> but looks like it needs to be updated :P [13:19:03] <_joe_> I think chris will do it eventually :) [13:19:30] nice [13:21:03] it misses the name<->nicks part though xd would be awesome to have filtering per-team or similar too [13:21:43] even the skill matrix thingie could be added :) [13:34:16] that's maybe out of scope for that tool :) [13:36:33] yeah, I guess that it depends on what problem you are trying to solve xd [13:41:17] <_joe_> imagine the skill matrix for all things SRE [13:41:23] <_joe_> 579 columns [13:42:31] I'm happy aggregating whatever is already in the wikis and make it searchable [14:21:55] kormat: sobanski: https://user.fm/files/v2-3503bf04a68d937a51cc995e848f2b34/calendar-running.png [14:22:10] I like the image there, very hopeful [14:22:18] at least google didn't render a snail [14:22:35] a more realistic image would have the shoelaces of both shoes tied to the other [14:22:35] although the laces being untied is very... telling [14:22:39] yes! [14:24:04] The s/paranoid/realistic side of me immediately thinks of "Gmail doing an image search for the meeting title in the background" [14:24:14] And then I get sad... [14:28:20] see also "wheel of cheese" - https://xkcd.com/2140/ [14:29:01] fwiw, as I understand it Google Calendar has a list of perhaps a dozen purpose-made artworks that it can pick from based on a list of (localised) words that it may match from the title. [14:29:31] in this case running->https://ssl.gstatic.com/calendar/images/eventillustrations/v1/img_running_1x.jpg [14:31:08] kormat: previously, we saw a fairly consistent delete rate of ~150/second on the various servers during the various purge runs. pc1008 is now doing 250/second [14:31:42] it's not 300 or higher, so that suggests deletes do take non-zero time, as otherwise reducing the sleep by > 50% should have at least doubled that rate afaik [14:32:00] good to know we're stil in the known underverse where actions take non-zero time [14:33:30] 😅 [14:34:11] looking forward to week three of the percona training where we learn to disable replication locality [14:36:25] but.. I'm also thinking this is quite a good concrete stat. At the risk of getting ahead of myself, I think that means we previously did 1.5 iterations per second (and might as well be 1.5 iterations per 0.25s + 0.75s sleep hwere 0.75=0.5*1.5), and now 2.5 batches per second or (2.5*0.2=0.5s sleep and 2.5 batches in the remaining 0.5s). [14:37:06] If true, we went from sleeping 75% of the time to sleeping 50% of the time. [14:39:59] Krinkle: nice, and once we switch to raid10 hosts it should also go faster [14:40:40] Krinkle: 46mins to get to 0.4%. that extrapolates to ... ~8 days. 😭 [14:46:19] kormat: I think we may need to go for a --shard parameter rather than a --server. The SqlBag abstraction doesn't lend it self to arbitrary hostname connections, but it does have a list of server shards we can pick from. The catch being we can't run it againt the spare... [14:46:55] and hence can't do the sql_bin_log=0 approach. damn. [14:47:06] I guess that's fine for the immediate "run concurrently" case, but we'll need to think about it some more in terms of not doing replication etc. [14:47:47] unless... you ignore mw layer altogether 😈 [14:47:47] maybe by then we can just turn replicatin off entirely if we're active active. and hope the other mitigations keep us in check until then. [14:48:19] * Krinkle posts job opening for DBA engineer to manually wipe parts of the harddrive with a toothbrush. [14:49:09] I am not suggesting anything crazy, just that, given the need, we setup a cron from root to do pc's cleanup per server [14:49:22] as a last option [14:49:46] https://phabricator.wikimedia.org/P16060 [14:49:56] yeah, we could evolve this a bit further and make it production-ready [14:49:59] not a bad idea jynus :) [14:50:09] I wouldn't suggest that as the first option [14:50:24] it is bad design already on top of not great current design [14:50:42] but I am sure binlog logging takes a considerable chunk of delete's workload [14:53:46] does it rewrite the former contents of the row to the binlog, or "just" the sql query? [14:53:58] yes [14:54:42] as in both, if using row, but I think it uses STATEMENT to avoid conflicts [14:54:51] we do use statement, yes [14:55:08] so it's just the query [14:55:46] uf innodb_change_buffering=none [14:56:01] jynus: does your mysql training include improving meme creation skills? I think I need to improve mine, as I"m clearly not fast enough to keep up with the conversation [14:56:28] kormat, did the slowness coincide with a change of that variable? [14:56:57] jynus: the "slowness" started > 1 year ago [14:57:04] ok [14:57:08] so if that's a recent change, then no :) [14:57:20] in any case, not sure if we care much about consistency of the pc [14:57:46] s/consistency of// [14:58:40] nah, probably not an issue, they only have one non-unique key [15:01:11] 552.61 hash searches/s, 4999.50 non-hash searches/s [15:01:35] consider, however doing test reenabling it for pc only [15:02:07] jynus: https://user.fm/files/v2-8bb1ed524dd454edf219b37e384589cd/capture-pc_oil-critical_acclaim.jpg [15:02:21] only 3 stars? [15:05:44] _joe_: dcaro: yes sorry, I've been meaning to update that map for a long time :) [15:05:51] so here is my train of thought- change buffer was disabled because apparently added instability for newer mariadb versions [15:06:09] and probably didn't had a huge impact on regular metadata or content dbs [15:06:32] but it may increase the performance on write-heavy dbs like pc, and we may not care about stability of those so much [15:09:15] compare that to, eg. and s3 master: 1705.19 hash searches/s, 2406.91 non-hash searches/s [15:10:41] ignore me that is the adaptive index stats, I cannot provide the ibuffer stats because it is disable everywhere [15:10:46] but it is something to test, maybe? [15:12:34] 16:56:57 jynus: the "slowness" started > 1 year ago [15:13:06] yes, but if this can make deletes faster, what's to lose? [15:13:58] we have no reason to believe it's relevant? [15:14:12] iirc that setting was changed a month or two ago [15:14:31] if it was sufficient to 'fix' deletes, we wouldn't have been in trouble for >1 year [15:16:12] it was changed in september/october 2020: https://phabricator.wikimedia.org/T263443#6497328 [15:19:02] it was set to none in March. https://phabricator.wikimedia.org/T263443#6890970 [15:21:33] jynus: sept/Oct was just a test on a single host, it wasn't set everywhere till a few months ago [15:21:49] yes, but couldn't it help in this case? [15:23:30] change buffer buffers both deletes and purges [15:26:13] I am just giving ideas, it seems it got worse recently and you don't seem to know exactly why yet [15:26:44] it could be that, it could be the SELECT IN query planning got worse [15:26:47] can be many things [15:27:41] the history summary of T282761 task description suggests it has been "slower" since at least April 2020. [15:27:42] T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 [15:28:11] its possible that maybe just the organic increase in traffic has tipped us over a point that month [15:28:41] but yeah, def other possibilities worth looking into, and even if not caused by it, might still offer value in terms of additional optimisations even if they're not regressions [15:29:03] I am not fishing for causes I am trying to give ideas for mitigations [15:29:22] Aaron was referring to reducing page splits/merges cost from delete operations by tuning the thresholds for that, but this sounds like it would potentially offer a similar benefit in terms of reducing disk I/O per delete [15:29:34] and then you can decide what is worth testing first, etc. [16:03:46] jynus: the change buffer was recommended by mariadb to avoid crashes. we can try it to purges only. however, the script got back to decent values (hours) a few weeks ago and grew a lot again a few days ago. it think it is a combination of many things, including hw using raid 5 [16:04:34] I am not arguing against that decision [16:04:58] but it could have a penalty on write latency, specially on pc hosts [16:06:15] is is the reasonfor the problem? probably not! [16:06:54] it definitely has a penalty in writes, that's for sure. but that's better than the crashes :-) [16:06:55] is it easier to test than rebuilding the pc workflow? Probably- which is one of the several suggestions I gave you [16:07:28] several suggestions have been given by many people, including myself [16:07:45] I also said other things: like doing infra purges without mw layer, etc [16:08:14] ok [16:08:32] but those are just suggestions in case you hadn't thought about them [16:09:08] if you tought them and think they are bad or not constructive, no problem! I was just trying to help 0:-) [16:38:59] FYI, we're going to re-enabled the codfw-eqsin Telia link, the issue should have been fixed yesterday [16:39:19] because now "it's fixed"? [16:39:20] :D [16:39:24] but it might not, so please ping topranks and I if you see signs of problems [16:39:29] https://gerrit.wikimedia.org/r/c/operations/dns/+/700027 is still there in case it's needed ;) [16:39:40] but easier to disable the link ofc [16:42:05] volans: yeah, we got in touch with some staff past their NOC [16:42:18] great [17:05:33] ah good to hear [17:22:25] If I merge an envoy config change, will puppet automatically deploy it / reload envoy for me, or do I need to do that manually? [17:22:38] https://gerrit.wikimedia.org/r/c/operations/puppet/+/699425 is the change in question [17:26:11] legoktm: puppet will automatically hot-restart envoy for you [17:26:57] thanks [17:27:41] (more details on "hot restart" if you haven't run into it and you're interested, but tldr the right thing will happen for you) [18:40:37] it's a little surprising to me that if I `systemctl mask` a unit, puppet doesn't undo that [18:52:25] yeah the "API" for this stuff isn't great all the way down, unless you apply a consistent way of managing all related fs paths for a unit name in puppet. [18:53:59] we, I think, tend to have our puppets and packages do things in /lib/systemd/system and then possibly put fragment overrides in /etc/systemd/... [18:54:15] but the "mask" command creates a symlink to /dev/null at /etc/systemd/system/foo.service [18:55:20] (which doesn't conflict with the fragments in /etc/systemd/system/foo.service.d/) [18:55:32] (doesn't conflict at the FS level, I mean) [18:57:28] yeah, it just ignores the mask file [18:57:55] I submited https://gerrit.wikimedia.org/r/c/operations/puppet/+/701171/ / filed https://phabricator.wikimedia.org/T285425, I think it'll need some more discussion though [18:59:02] my idea was that before the switchover, we could mask all the mediawiki_job_* units, and then when puppet is re-enabled post switchover, it would unmask them [19:00:02] * legoktm checks if puppet will re-enable disabled timers [19:03:53] it does [19:06:54] bummer, `systemctl disable` doesn't like wildcards [19:09:23] something something, systemctl list-units, xargs? 😬 [19:14:25] I feel like I killed a bash script so I'm allowed to introduce a hacky xargs pipeline [19:15:56] with cdanis going on leave, it is incumbent on all of us to carry the torch [19:19:58] lololol [19:32:37] https://i.pinimg.com/originals/ea/36/e4/ea36e48e1e7f2f89c974d1ce9bd5179f.gif [19:35:36] cdanis.gif [19:53:29] tortured bash one-liners are the first step towards working sysop software! [20:04:59] step two is reimplementing config management via curl|bash right? :)