[07:44:21] <marostegui>	 can something be done about the processor alert that has been going on and off during the weekend?
[08:02:44] <XioNoX>	 yeah I'm going to have a look
[08:02:58] <XioNoX>	 it's actually several console servers
[08:15:58] <XioNoX>	 it's weird that many of them started to be at 100% at the same time
[08:20:46] <XioNoX>	 annnd their CPU is fine, so I guess snmp/reporting issue
[08:31:48] <Amir1>	 apergos: I have some dumps cron->systemd patches for you to take a look at when have some time 
[08:31:53] <Amir1>	 Thanks <3
[08:33:52] <apergos>	 Amir1: I saw them!  I'm theoretically off today (bank holiday) so I'll likely look at them tomorrow.
[08:34:16] <apergos>	 where "look at" means "merge if ok".
[08:34:51] <Amir1>	 Thanks. No rush. 
[08:34:55] <Amir1>	 Enjoy your holiday
[08:42:18] <apergos>	 thank you! mayyou have a productive low-stress day!
[08:47:03] <XioNoX>	 I disabled the CPU alert in librenms for now (cc topranks)
[09:07:49] <kormat>	 nrpe::monitor_systemd_unit_state takes a param, `$check_interval`. that gets passed to `nrpe::monitor_service`, which documents all params _except_ that one, and passes it to `monitoring::service`, which doesn't document it at all. can anyone tell me what the param means?
[09:12:12] <volans>	 kormat: how often icinga should perform the check, see https://icinga.com/docs/icinga1/latest/en/objectdefinitions.html#objectdefinitions-service
[09:12:45] <volans>	 when the check passes
[09:14:05] <kormat>	 ahh. thank you!
[09:14:30] <kormat>	 ok, so the default of `$check_interval=1` means it'll check every minute
[09:14:42] <volans>	 yes
[09:14:45] <volans>	 more or less
[09:14:46] <kormat>	 that's pretty terrible, but no worse than expected
[09:21:15] <joe>	 why is that terrible?
[09:30:38] <kormat>	 joe: it's a long-ass time for some things
[09:31:18] <kormat>	 in this particular case, pt-heartbeat. if it's not running on a db primary, bad things are going to happen veery quickly
[09:35:35] <kormat>	 which is exacerbated by the lack of alert aggregation in icinga. _all_ replicas are going to alert due to 'lag'
[09:35:58] <kormat>	 (well, also lack of alert dependencies/suppression)
[09:36:25] <volans>	 kormat: while technically interval_length it's configurable in icinga.cfg to something more granular (but we'd need to adjust all our existing check_interval), with the current latency on icinga it wouldn't matter much
[09:36:36] <volans>	 (min/max/avg 0.20 sec    54.82 sec    51.667 sec)
[09:36:45] <kormat>	 volans: uff. ack.
[09:37:01] <volans>	 so I'd say that icinga is the wrong tool if you want something sub-minute or anything quicker
[09:37:13] <volans>	 but that's my personal opinion
[09:37:23] <kormat>	 s/if you want .*//
[09:37:37] <apergos>	 lol
[09:38:09] <volans>	 check with o11y if alertmanager might fit the usecase, but you need a prometheus metric for that to alert
[09:38:12] <volans>	 AFAIK
[09:38:42] <apergos>	 I guess you are looking on the order of once every second?
[09:39:07] <kormat>	 at $LASTJOB we had prometheus doing a scrape every 5s
[09:39:32] <apergos>	 5 is not unreasonable
[09:39:54] <kormat>	 volans: aye. and using the node-exporter's textfile feature for that doesn't get you below 60s if you're using cron to drive it anyway
[09:40:33] <kormat>	 what does mw do if all replicas are lagging?
[09:40:38] <kormat>	 "lagging"
[09:42:03] <apergos>	 Set master DB handles as read-only if there is high replication lag
[09:42:19] <apergos>	 I'm just grepping around because it's been awhile and my knowledge has long since bitrotted
[09:42:36] <apergos>	 that's in ./includes/libs/rdbms/loadbalancer/LoadBalancer.php
[09:43:29] <kormat>	 apergos: i don't even know what gerrit repo that would be in ;)
[09:43:50] <kormat>	 but in any case, yeah, that sounds about what i'd expect
[09:44:31] <apergos>	 https://github.com/wikimedia/mediawiki/blob/master/includes/libs/rdbms/loadbalancer/LoadBalancer.php#L484
[09:44:36] <apergos>	 this is where the check is done
[09:44:40] <apergos>	 so yeah that's basically it
[09:45:52] <kormat>	 👍
[10:09:41] <jbond>	 nex attempt to migrate production netbox to use SSO CAS (XioNoX topranks)
[10:11:07] <topranks>	 👍
[10:45:14] <jbond>	 can someone update the topic in operations and set me as clinic duty (or give me the flags to do so) thx
[10:46:34] <jbond>	 fyi the netbox  migration should be done please let me know if there are any issues
[10:47:17] <jbond>	 i spoke to soon, one more change :)
[11:00:22] <marostegui>	 done
[11:05:10] <jbond>	 marostegui: thanks
[11:17:08] <jbond>	 and the netbox cas migration is done now
[11:18:00] <volans>	 thx!
[11:22:37] <moritzm>	 \o/
[11:23:23] <joe>	 great
[16:39:05] <legoktm>	 yay! <3 SSO
[21:02:41] <RhinosF1>	 legoktm: are you trying to switch everything this year?
[21:03:00] <legoktm>	 no, just...a little bit more.
[21:03:43] <RhinosF1>	 I've noticed a lot being done so good luck
[21:03:49] <legoktm>	 :)
[21:03:53] <RhinosF1>	 But I'm pretty sure you won't need luck
[22:09:44] <Krinkle>	 legoktm: "if duration > 0.95 * last_durat"
[22:09:47] <Krinkle>	 this is so good
[22:09:49] <Krinkle>	 I love it
[22:09:54] <Krinkle>	 when did this get ported to python?
[22:10:11] <legoktm>	 not me!
[22:10:44] <rzl>	 oh, the "rerun until it converges" thing? that was me, circa the last switchback
[22:11:02] <legoktm>	 looks like rzl in Change-Id: If9f7acc914d21f945157c83b001aa742be58cb5e
[22:11:12] <Krinkle>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/e51bab27e425cf782d48b29742fd208d6d6da08c
[22:11:28] <Krinkle>	 rzl: this commit message suggests there is (was?) a Node.js version as well 
[22:11:31] <Krinkle>	 at the same time
[22:11:52] <legoktm>	 the node tool is mediawiki-cache-warmup in puppet
[22:11:53] <Krinkle>	 this python code has existed for 2+ year apparently
[22:12:01] <Krinkle>	 maybe it wasn't used?
[22:12:04] <rzl>	 oh, didn't mean to imply that -- just that we were talking about separately moving the warmup script itself (which is your node.js) into python
[22:12:09] <rzl>	 this python calls into that node script
[22:12:23] <Krinkle>	 ohh
[22:12:27] <Krinkle>	 I completely missed that
[22:12:29] <Krinkle>	 right, that's still there
[22:13:00] <Krinkle>	 ah, it's applying the 0.95 to the whole nodejs run of all urls and servers together
[22:13:01] <Krinkle>	 interesting
[22:13:44] <rzl>	 right -- which means it's sensitive to the speedup of slowest host, but that's probably correct, and in any event it's what we were already doing by hand :P
[22:14:01] <Krinkle>	 so I guess we don't mind letting the script take longer and just run the per-server urls on both app + app_api servers?
[22:14:12] <Krinkle>	 I think we don't need to do the per-cluster urls twice though
[22:14:35] <rzl>	 IMO that should be fine -- obviously this all runs before we go RO, so taking longer is no big deal
[22:15:08] <rzl>	 are you sure? note they're two separate clusters, so as-is I don't think we were sending *any* traffic to the api servers, even the per-cluster urls
[22:15:13] <rzl>	 I might be missing something though
[22:15:30] <rzl>	 oh wait no I see what you're saying, hmmm
[22:16:04] <Krinkle>	 the cluster-wide purposes are mainly for DBs, Memc, etc.
[22:16:08] <rzl>	 I guess the answer to what I'm saying is, maybe there should be a per-appserver-cluster and a per-APIserver-cluster, but if we also don't need anything in that last set, that's fine
[22:16:26] <rzl>	 nod
[22:16:28] <Krinkle>	 well, I think we should definitely run /something/ on each api app server
[22:16:38] <Krinkle>	 for opcache, etcd, apcu etc.
[22:16:40] <rzl>	 the per-server stuff yeah
[22:16:57] <Krinkle>	 but maybe just anything at all will suffice there
[22:16:58] <rzl>	 I'm behind you but I'm catching up :) I agree we don't need to run the per-cluster stuff twice
[22:17:17] <Krinkle>	 even if it's load.php.
[22:18:15] <rzl>	 anything at all would be a big improvement -- I'm not sure what we need to warm them up fully, but even something minimal would be a big step up
[22:19:10] <Krinkle>	 yeah, as-is if we apply half of legoktm 's current patch, that would mean the api servers get load.php run on them. which means etcd/apcu gets warmed up and all shared MW code for all reqs incl a good chunk of opcache
[22:20:04] <Krinkle>	 but we could also split per-server into per-appserver and per-apiappserver and add a light api.php query to the latter instead of load.php
[22:20:30] <legoktm>	 the reason I included per-cluster in there was because it was the only list that has a api.php URL so it seemed like someone intended it to run against the API cluster
[22:20:34] <Krinkle>	 maybe not recentchanges though, as that would 300x db load during warmup and probably not measure the right thing
[22:21:09] <Krinkle>	 the api query was mainly meant to warm up related dbs and memcs backends, not the webserver itself per se
[22:21:17] <legoktm>	 ah
[22:21:36] <legoktm>	 so I was on the right track adding something to the per-server list
[22:21:37] <Krinkle>	 but it was indeed a mistake not to include apiservers at all in the per-server iteration
[22:21:52] <Krinkle>	 so the perserver should probably run over both appserver and apiappserver
[22:22:13] <Krinkle>	 but whether the cluster-wide warmup runs over one lb vs the other, doesn't make much difference I think
[22:41:27] <Krinkle>	 legoktm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/700712/
[22:53:34] <Krinkle>	 legoktm: maybe a siteinfo query for per-server urls would make sense, that might give it a bit more opcache coverage
[22:54:11] <Krinkle>	 note that it runs against all large wikis, including non-english/non-wikipedia
[22:55:13] <Krinkle>	 for load.php this doesn't make much difference, but for api.php we'll want something that ideally doesn't hit a specific title (plus, that'd be more of a cluster-warmup for dbs/memc/parsercache).
[22:55:34] <Krinkle>	 Speaking of which, my cluster wide urls for main_page probably aren't ideal either, but I suspect all those have redirects in place
[22:55:55] <Krinkle>	 See also https://phabricator.wikimedia.org/T120085
[22:56:17] <Krinkle>	 for Fresnel, where I didn't want a redirect, I worked around it by targetting /w/?_mainpagehack=1 
[22:56:25] <Krinkle>	 which works because the default title is local main page
[22:56:32] <Krinkle>	 and with non-empty query param, it won't redirect
[23:05:44] <Krinkle>	 rzl: effie btw, we're enabling on-host tier for the big one (wancache) on beta cluster now
[23:07:41] <legoktm>	 XioNoX: I'm going to merge "Remove duplicated fake netbox keys" from labs/private
[23:12:48] <legoktm>	 https://en.wikipedia.org/w/api.php?modules=query+siteinfo
[23:13:22] <legoktm>	 I think namespaces|specialpagealiases|magicwords|languages|extensiontags would hit a decent amount
[23:23:46] <legoktm>	 Krinkle: how's https://gerrit.wikimedia.org/r/c/operations/puppet/+/700716 ?
[23:25:13] <Krinkle>	 legoktm: yeah, LGTM
[23:25:22] <Krinkle>	 also hits a fair amount of Language stuff I believe
[23:26:46] <legoktm>	 https://gerrit.wikimedia.org/g/mediawiki/core/+/2ec406ecc8fcae175a7afe4f2dab5b9f5d44cd70/includes/api/ApiQuerySiteinfo.php#740 looks like someone optimized it, so it doesn't hit Language anymore