[09:45:28] <kormat>	 oh nice. nrpe::monitor_systemd_unit_state accepts a `contact_group` param... and then completely ignores it
[09:52:47] <dcaro>	 it's just to make you feel better :)
[09:53:55] <kormat>	 it didn't _quite_ manage that ;)
[10:09:45] <dcaro>	 pcc (puppet catalog compiler) does it work on stacked patches?
[10:09:50] <dcaro>	 ^ anyone knows?
[10:23:34] <_joe_>	 yes
[10:23:46] <_joe_>	 it does git checkout of the whole patch tree
[10:24:20] <_joe_>	 which also means you should always rebase your patch tree on top of master before running the compiler
[10:24:34] <_joe_>	 I mean rebase and update, ofc
[10:38:02] <dcaro>	 ack thanks
[11:50:05] <Amir1>	 apergos: thanks for merging the patch <3
[11:50:19] <apergos>	 thanks for making it!
[12:37:53] <kormat>	 _joe_: how does the puppet style guide feel about roles including other roles?
[12:40:25] <_joe_>	 kormat: generally frowned upon, but IIRC is allowed
[12:41:27] <kormat>	 _joe_: how about roles that include a single profile, which then includes a bunch of other profiles?
[12:42:41] <_joe_>	 kormat: view it like this: a profile should provide some functionality, and you should be able to compose multiple profiles into what goes onto a series of servers as a role
[12:43:23] <kormat>	 i'm thinking of >1 role that only include (the same) single profile
[12:44:09] <_joe_>	 that might work in some cases, think of the appservers
[12:44:23] <_joe_>	 we have different roles for apis and appservers, but they include the same profiles
[12:44:55] <kormat>	 ok cool. i'll mention in the CR that you signed off on this. thanks!
[12:45:32] <_joe_>	 *cough*
[12:46:14] <kormat>	 "It's not a Stevie puppet CR unless it causes a constitutional puppet-style crisis" (™)
[12:47:20] <_joe_>	 nah it's just you ask abstract questions that have as a general answer "it depends"
[12:47:54] <kormat>	 _joe_: https://phabricator.wikimedia.org/T285390 is the concrete situation
[12:48:43] <_joe_>	 kormat: yes, that's literally the same reason why we have different roles for apis and appservers
[12:49:03] <_joe_>	 they are configured differently, so it's easy to add automation/distinguish them in hiera
[12:52:09] <kormat>	 oh nice role::mariadb::core already includes role::mariadb::ferm.
[13:10:45] <dcaro>	 this is awesome! https://timerelengteam.toolforge.org/ is it a general thing?
[13:16:19] <dcaro>	 it does noot seem so xd
[13:17:46] <_joe_>	 dcaro: we got https://people.wikimedia.org/~cdanis/sremap/ (nda login required)
[13:18:03] <_joe_>	 but looks like it needs to be updated :P
[13:19:03] <_joe_>	 I think chris will do it eventually :)
[13:19:30] <dcaro>	 nice
[13:21:03] <dcaro>	 it misses the name<->nicks part though xd would be awesome to have filtering per-team or similar too
[13:21:43] <dcaro>	 even the skill matrix thingie could be added :)
[13:34:16] <vgutierrez>	 that's maybe out of scope for that tool :)
[13:36:33] <dcaro>	 yeah, I guess that it depends on what problem you are trying to solve xd
[13:41:17] <_joe_>	 imagine the skill matrix for all things SRE
[13:41:23] <_joe_>	 579 columns
[13:42:31] <dcaro>	 I'm happy aggregating whatever is already in the wikis and make it searchable
[14:21:55] <Krinkle>	 kormat: sobanski: https://user.fm/files/v2-3503bf04a68d937a51cc995e848f2b34/calendar-running.png
[14:22:10] <Krinkle>	 I like the image there, very hopeful
[14:22:18] <Krinkle>	 at least google didn't render a snail
[14:22:35] <kormat>	 a more realistic image would have the shoelaces of both shoes tied to the other
[14:22:35] <Krinkle>	 although the laces being untied is very... telling
[14:22:39] <Krinkle>	 yes!
[14:24:04] <sobanski>	 The s/paranoid/realistic side of me immediately thinks of "Gmail doing an image search for the meeting title in the background"
[14:24:14] <sobanski>	 And then I get sad...
[14:28:20] <Krinkle>	 see also "wheel of cheese" - https://xkcd.com/2140/
[14:29:01] <Krinkle>	 fwiw, as I understand it Google Calendar has a list of perhaps a dozen purpose-made artworks that it can pick from based on a list of (localised) words that it may match from the title.
[14:29:31] <Krinkle>	 in this case running->https://ssl.gstatic.com/calendar/images/eventillustrations/v1/img_running_1x.jpg
[14:31:08] <Krinkle>	 kormat: previously, we saw a fairly consistent delete rate of ~150/second on the various servers during the various purge runs. pc1008 is now doing 250/second
[14:31:42] <Krinkle>	 it's not 300 or higher, so that suggests deletes do take non-zero time, as otherwise reducing the sleep by > 50% should have at least doubled that rate afaik
[14:32:00] <Krinkle>	 good to know we're stil in the known underverse where actions take non-zero time
[14:33:30] <kormat>	 😅
[14:34:11] <rzl>	 looking forward to week three of the percona training where we learn to disable replication locality
[14:36:25] <Krinkle>	 but.. I'm also thinking this is quite a good concrete stat. At the risk of getting ahead of myself, I think that means we previously did 1.5 iterations per second (and might as well be 1.5 iterations per 0.25s + 0.75s sleep hwere 0.75=0.5*1.5), and now 2.5 batches per second or (2.5*0.2=0.5s sleep and 2.5 batches in the remaining 0.5s).
[14:37:06] <Krinkle>	 If true, we went from sleeping 75% of the time to sleeping 50% of the time.
[14:39:59] <marostegui>	 Krinkle: nice, and once we switch to raid10 hosts it should also go faster
[14:40:40] <kormat>	 Krinkle: 46mins to get to 0.4%. that extrapolates to ... ~8 days. 😭
[14:46:19] <Krinkle>	 kormat: I think we may need to go for a --shard parameter rather than a --server. The SqlBag abstraction doesn't lend it self to arbitrary hostname connections, but it does have a list of server shards we can pick from. The catch being we can't run it againt the spare...
[14:46:55] <kormat>	 and hence can't do the sql_bin_log=0 approach. damn.
[14:47:06] <Krinkle>	 I guess that's fine for the immediate "run concurrently" case, but we'll need to think about it some more in terms of not doing replication etc.
[14:47:47] <jynus>	 unless... you ignore mw layer altogether 😈
[14:47:47] <Krinkle>	 maybe by then we can just turn replicatin off entirely if we're active active. and hope the other mitigations keep us in check until then.
[14:48:19] * Krinkle posts job opening for DBA engineer to manually wipe parts of the harddrive with a toothbrush.
[14:49:09] <jynus>	 I am not suggesting anything crazy, just that, given the need, we setup a cron from root to do pc's cleanup per server
[14:49:22] <jynus>	 as a last option
[14:49:46] <Krinkle>	 https://phabricator.wikimedia.org/P16060
[14:49:56] <Krinkle>	 yeah, we could evolve this a bit further and make it production-ready
[14:49:59] <Krinkle>	 not a bad idea jynus :)
[14:50:09] <jynus>	 I wouldn't suggest that as the first option
[14:50:24] <jynus>	 it is bad design already on top of not great current design
[14:50:42] <jynus>	 but I am sure binlog logging takes a considerable chunk of delete's workload
[14:53:46] <Krinkle>	 does it rewrite the former contents of the row to the binlog, or "just" the sql query?
[14:53:58] <jynus>	 yes
[14:54:42] <jynus>	 as in both, if using row, but I think it uses STATEMENT to avoid conflicts
[14:54:51] <kormat>	 we do use statement, yes
[14:55:08] <kormat>	 so it's just the query
[14:55:46] <jynus>	 uf innodb_change_buffering=none
[14:56:01] <Krinkle>	 jynus: does your mysql training include improving meme creation skills? I think I need to improve mine, as I"m clearly not fast enough to keep up with the conversation
[14:56:28] <jynus>	 kormat, did the slowness coincide with a change of that variable?
[14:56:57] <kormat>	 jynus: the "slowness" started > 1 year ago
[14:57:04] <jynus>	 ok
[14:57:08] <kormat>	 so if that's a recent change, then no :)
[14:57:20] <jynus>	 in any case, not sure if we care much about consistency of the pc
[14:57:46] <kormat>	 s/consistency of//
[14:58:40] <jynus>	 nah, probably not an issue, they only have one non-unique key
[15:01:11] <jynus>	 552.61 hash searches/s, 4999.50 non-hash searches/s
[15:01:35] <jynus>	 consider, however doing test reenabling it for pc only
[15:02:07] <Krinkle>	 jynus: https://user.fm/files/v2-8bb1ed524dd454edf219b37e384589cd/capture-pc_oil-critical_acclaim.jpg
[15:02:21] <jynus>	 only 3 stars?
[15:05:44] <cdanis>	 _joe_: dcaro: yes sorry, I've been meaning to update that map for a long time :)
[15:05:51] <jynus>	 so here is my train of thought- change buffer was disabled because apparently added instability for newer mariadb versions
[15:06:09] <jynus>	 and probably didn't had a huge impact on regular metadata or content dbs
[15:06:32] <jynus>	 but it may increase the performance on write-heavy dbs like pc, and we may not care about stability of those so much
[15:09:15] <jynus>	 compare that to, eg. and s3 master: 1705.19 hash searches/s, 2406.91 non-hash searches/s
[15:10:41] <jynus>	 ignore me that is the adaptive index stats, I cannot provide the ibuffer stats because it is disable everywhere
[15:10:46] <jynus>	 but it is something to test, maybe?
[15:12:34] <kormat>	 16:56:57 <kormat> jynus: the "slowness" started > 1 year ago
[15:13:06] <jynus>	 yes, but if this can make deletes faster, what's to lose?
[15:13:58] <kormat>	 we have no reason to believe it's relevant?
[15:14:12] <kormat>	 iirc that setting was changed a month or two ago
[15:14:31] <kormat>	 if it was sufficient to 'fix' deletes, we wouldn't have been in trouble for >1 year
[15:16:12] <jynus>	 it was changed in september/october 2020: https://phabricator.wikimedia.org/T263443#6497328
[15:19:02] <kormat>	 it was set to none in March. https://phabricator.wikimedia.org/T263443#6890970
[15:21:33] <marostegui>	 jynus: sept/Oct was just a test on a single host, it wasn't set everywhere till a few months ago
[15:21:49] <jynus>	 yes, but couldn't it help in this case?
[15:23:30] <jynus>	 change buffer buffers both deletes and purges
[15:26:13] <jynus>	 I am just giving ideas, it seems it got worse recently and you don't seem to know exactly why yet
[15:26:44] <jynus>	 it could be that, it could be the SELECT IN query planning got worse
[15:26:47] <jynus>	 can be many things
[15:27:41] <Krinkle>	 the history summary of T282761 task description suggests it has been "slower" since at least April 2020.
[15:27:42] <stashbot>	 T282761: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761
[15:28:11] <Krinkle>	 its possible that maybe just the organic increase in traffic has tipped us over a point that month
[15:28:41] <Krinkle>	 but yeah, def other possibilities worth looking into, and even if not caused by it, might still offer value in terms of additional optimisations even if they're not regressions
[15:29:03] <jynus>	 I am not fishing for causes I am trying to give ideas for mitigations
[15:29:22] <Krinkle>	 Aaron was referring to reducing page splits/merges cost from delete operations by tuning the thresholds for that, but this sounds like it would potentially offer a similar benefit in terms of reducing disk I/O per delete
[15:29:34] <jynus>	 and then you can decide what is worth testing first, etc.
[16:03:46] <marostegui>	 jynus: the change buffer was recommended by mariadb to avoid crashes. we can try it to purges only. however, the script got back to decent values (hours) a few weeks ago and grew a lot again a few days ago. it think it is a combination of many things, including hw using raid 5
[16:04:34] <jynus>	 I am not arguing against that decision
[16:04:58] <jynus>	 but it could have a penalty on write latency, specially on pc hosts
[16:06:15] <jynus>	 is is the reasonfor the problem? probably not!
[16:06:54] <marostegui>	 it definitely has a penalty in writes, that's for sure. but that's better than the crashes :-)
[16:06:55] <jynus>	 is it easier to test than rebuilding the pc workflow? Probably- which is one of the several suggestions I gave you
[16:07:28] <marostegui>	 several suggestions have been given by many people, including myself
[16:07:45] <jynus>	 I also said other things: like doing infra purges without mw layer, etc
[16:08:14] <marostegui>	 ok
[16:08:32] <jynus>	 but those are just suggestions in case you hadn't thought about them
[16:09:08] <jynus>	 if you tought them and think they are bad or not constructive, no problem! I was just trying to help 0:-)
[16:38:59] <XioNoX>	 FYI, we're going to re-enabled the codfw-eqsin Telia link, the issue should have been fixed yesterday
[16:39:19] <volans>	 because now "it's fixed"?
[16:39:20] <volans>	 :D
[16:39:24] <XioNoX>	 but it might not, so please ping topranks and I if you see signs of problems
[16:39:29] <volans>	 https://gerrit.wikimedia.org/r/c/operations/dns/+/700027 is still there in case it's needed ;)
[16:39:40] <volans>	 but easier to disable the link ofc
[16:42:05] <XioNoX>	 volans: yeah, we got in touch with some staff past their NOC
[16:42:18] <volans>	 great
[17:05:33] <cdanis>	 ah good to hear
[17:22:25] <legoktm>	 If I merge an envoy config change, will puppet automatically deploy it / reload envoy for me, or do I need to do that manually?
[17:22:38] <legoktm>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/699425 is the change in question
[17:26:11] <rzl>	 legoktm: puppet will automatically hot-restart envoy for you
[17:26:57] <legoktm>	 thanks
[17:27:41] <rzl>	 (more details on "hot restart" if you haven't run into it and you're interested, but tldr the right thing will happen for you)
[18:40:37] <legoktm>	 it's a little surprising to me that if I `systemctl mask` a unit, puppet doesn't undo that
[18:52:25] <bblack>	 yeah the "API" for this stuff isn't great all the way down, unless you apply a consistent way of managing all related fs paths for a unit name in puppet.
[18:53:59] <bblack>	 we, I think, tend to have our puppets and packages do things in /lib/systemd/system and then possibly put fragment overrides in /etc/systemd/...
[18:54:15] <bblack>	 but the "mask" command creates a symlink to /dev/null at /etc/systemd/system/foo.service
[18:55:20] <bblack>	 (which doesn't conflict with the fragments in /etc/systemd/system/foo.service.d/)
[18:55:32] <bblack>	 (doesn't conflict at the FS level, I mean)
[18:57:28] <legoktm>	 yeah, it just ignores the mask file
[18:57:55] <legoktm>	 I submited https://gerrit.wikimedia.org/r/c/operations/puppet/+/701171/ / filed https://phabricator.wikimedia.org/T285425, I think it'll need some more discussion though
[18:59:02] <legoktm>	 my idea was that before the switchover, we could mask all the mediawiki_job_* units, and then when puppet is re-enabled post switchover, it would unmask them
[19:00:02] * legoktm checks if puppet will re-enable disabled timers
[19:03:53] <legoktm>	 it does
[19:06:54] <legoktm>	 bummer, `systemctl disable` doesn't like wildcards
[19:09:23] <rzl>	 something something, systemctl list-units, xargs? 😬
[19:14:25] <legoktm>	 I feel like I killed a bash script so I'm allowed to introduce a hacky xargs pipeline
[19:15:56] <rzl>	 with cdanis going on leave, it is incumbent on all of us to carry the torch
[19:19:58] <apergos>	 lololol
[19:32:37] <cdanis>	 https://i.pinimg.com/originals/ea/36/e4/ea36e48e1e7f2f89c974d1ce9bd5179f.gif
[19:35:36] <kormat>	 cdanis.gif
[19:53:29] <bd808>	 tortured bash one-liners are the first step towards working sysop software!
[20:04:59] <bblack>	 step two is reimplementing config management via curl|bash right? :)