[07:44:21] can something be done about the processor alert that has been going on and off during the weekend? [08:02:44] yeah I'm going to have a look [08:02:58] it's actually several console servers [08:15:58] it's weird that many of them started to be at 100% at the same time [08:20:46] annnd their CPU is fine, so I guess snmp/reporting issue [08:31:48] apergos: I have some dumps cron->systemd patches for you to take a look at when have some time [08:31:53] Thanks <3 [08:33:52] Amir1: I saw them! I'm theoretically off today (bank holiday) so I'll likely look at them tomorrow. [08:34:16] where "look at" means "merge if ok". [08:34:51] Thanks. No rush. [08:34:55] Enjoy your holiday [08:42:18] thank you! mayyou have a productive low-stress day! [08:47:03] I disabled the CPU alert in librenms for now (cc topranks) [09:07:49] nrpe::monitor_systemd_unit_state takes a param, `$check_interval`. that gets passed to `nrpe::monitor_service`, which documents all params _except_ that one, and passes it to `monitoring::service`, which doesn't document it at all. can anyone tell me what the param means? [09:12:12] kormat: how often icinga should perform the check, see https://icinga.com/docs/icinga1/latest/en/objectdefinitions.html#objectdefinitions-service [09:12:45] when the check passes [09:14:05] ahh. thank you! [09:14:30] ok, so the default of `$check_interval=1` means it'll check every minute [09:14:42] yes [09:14:45] more or less [09:14:46] that's pretty terrible, but no worse than expected [09:21:15] why is that terrible? [09:30:38] joe: it's a long-ass time for some things [09:31:18] in this particular case, pt-heartbeat. if it's not running on a db primary, bad things are going to happen veery quickly [09:35:35] which is exacerbated by the lack of alert aggregation in icinga. _all_ replicas are going to alert due to 'lag' [09:35:58] (well, also lack of alert dependencies/suppression) [09:36:25] kormat: while technically interval_length it's configurable in icinga.cfg to something more granular (but we'd need to adjust all our existing check_interval), with the current latency on icinga it wouldn't matter much [09:36:36] (min/max/avg 0.20 sec 54.82 sec 51.667 sec) [09:36:45] volans: uff. ack. [09:37:01] so I'd say that icinga is the wrong tool if you want something sub-minute or anything quicker [09:37:13] but that's my personal opinion [09:37:23] s/if you want .*// [09:37:37] lol [09:38:09] check with o11y if alertmanager might fit the usecase, but you need a prometheus metric for that to alert [09:38:12] AFAIK [09:38:42] I guess you are looking on the order of once every second? [09:39:07] at $LASTJOB we had prometheus doing a scrape every 5s [09:39:32] 5 is not unreasonable [09:39:54] volans: aye. and using the node-exporter's textfile feature for that doesn't get you below 60s if you're using cron to drive it anyway [09:40:33] what does mw do if all replicas are lagging? [09:40:38] "lagging" [09:42:03] Set master DB handles as read-only if there is high replication lag [09:42:19] I'm just grepping around because it's been awhile and my knowledge has long since bitrotted [09:42:36] that's in ./includes/libs/rdbms/loadbalancer/LoadBalancer.php [09:43:29] apergos: i don't even know what gerrit repo that would be in ;) [09:43:50] but in any case, yeah, that sounds about what i'd expect [09:44:31] https://github.com/wikimedia/mediawiki/blob/master/includes/libs/rdbms/loadbalancer/LoadBalancer.php#L484 [09:44:36] this is where the check is done [09:44:40] so yeah that's basically it [09:45:52] 👍 [10:09:41] nex attempt to migrate production netbox to use SSO CAS (XioNoX topranks) [10:11:07] 👍 [10:45:14] can someone update the topic in operations and set me as clinic duty (or give me the flags to do so) thx [10:46:34] fyi the netbox migration should be done please let me know if there are any issues [10:47:17] i spoke to soon, one more change :) [11:00:22] done [11:05:10] marostegui: thanks [11:17:08] and the netbox cas migration is done now [11:18:00] thx! [11:22:37] \o/ [11:23:23] great [16:39:05] yay! <3 SSO [21:02:41] legoktm: are you trying to switch everything this year? [21:03:00] no, just...a little bit more. [21:03:43] I've noticed a lot being done so good luck [21:03:49] :) [21:03:53] But I'm pretty sure you won't need luck [22:09:44] legoktm: "if duration > 0.95 * last_durat" [22:09:47] this is so good [22:09:49] I love it [22:09:54] when did this get ported to python? [22:10:11] not me! [22:10:44] oh, the "rerun until it converges" thing? that was me, circa the last switchback [22:11:02] looks like rzl in Change-Id: If9f7acc914d21f945157c83b001aa742be58cb5e [22:11:12] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/e51bab27e425cf782d48b29742fd208d6d6da08c [22:11:28] rzl: this commit message suggests there is (was?) a Node.js version as well [22:11:31] at the same time [22:11:52] the node tool is mediawiki-cache-warmup in puppet [22:11:53] this python code has existed for 2+ year apparently [22:12:01] maybe it wasn't used? [22:12:04] oh, didn't mean to imply that -- just that we were talking about separately moving the warmup script itself (which is your node.js) into python [22:12:09] this python calls into that node script [22:12:23] ohh [22:12:27] I completely missed that [22:12:29] right, that's still there [22:13:00] ah, it's applying the 0.95 to the whole nodejs run of all urls and servers together [22:13:01] interesting [22:13:44] right -- which means it's sensitive to the speedup of slowest host, but that's probably correct, and in any event it's what we were already doing by hand :P [22:14:01] so I guess we don't mind letting the script take longer and just run the per-server urls on both app + app_api servers? [22:14:12] I think we don't need to do the per-cluster urls twice though [22:14:35] IMO that should be fine -- obviously this all runs before we go RO, so taking longer is no big deal [22:15:08] are you sure? note they're two separate clusters, so as-is I don't think we were sending *any* traffic to the api servers, even the per-cluster urls [22:15:13] I might be missing something though [22:15:30] oh wait no I see what you're saying, hmmm [22:16:04] the cluster-wide purposes are mainly for DBs, Memc, etc. [22:16:08] I guess the answer to what I'm saying is, maybe there should be a per-appserver-cluster and a per-APIserver-cluster, but if we also don't need anything in that last set, that's fine [22:16:26] nod [22:16:28] well, I think we should definitely run /something/ on each api app server [22:16:38] for opcache, etcd, apcu etc. [22:16:40] the per-server stuff yeah [22:16:57] but maybe just anything at all will suffice there [22:16:58] I'm behind you but I'm catching up :) I agree we don't need to run the per-cluster stuff twice [22:17:17] even if it's load.php. [22:18:15] anything at all would be a big improvement -- I'm not sure what we need to warm them up fully, but even something minimal would be a big step up [22:19:10] yeah, as-is if we apply half of legoktm 's current patch, that would mean the api servers get load.php run on them. which means etcd/apcu gets warmed up and all shared MW code for all reqs incl a good chunk of opcache [22:20:04] but we could also split per-server into per-appserver and per-apiappserver and add a light api.php query to the latter instead of load.php [22:20:30] the reason I included per-cluster in there was because it was the only list that has a api.php URL so it seemed like someone intended it to run against the API cluster [22:20:34] maybe not recentchanges though, as that would 300x db load during warmup and probably not measure the right thing [22:21:09] the api query was mainly meant to warm up related dbs and memcs backends, not the webserver itself per se [22:21:17] ah [22:21:36] so I was on the right track adding something to the per-server list [22:21:37] but it was indeed a mistake not to include apiservers at all in the per-server iteration [22:21:52] so the perserver should probably run over both appserver and apiappserver [22:22:13] but whether the cluster-wide warmup runs over one lb vs the other, doesn't make much difference I think [22:41:27] legoktm: https://gerrit.wikimedia.org/r/c/operations/puppet/+/700712/ [22:53:34] legoktm: maybe a siteinfo query for per-server urls would make sense, that might give it a bit more opcache coverage [22:54:11] note that it runs against all large wikis, including non-english/non-wikipedia [22:55:13] for load.php this doesn't make much difference, but for api.php we'll want something that ideally doesn't hit a specific title (plus, that'd be more of a cluster-warmup for dbs/memc/parsercache). [22:55:34] Speaking of which, my cluster wide urls for main_page probably aren't ideal either, but I suspect all those have redirects in place [22:55:55] See also https://phabricator.wikimedia.org/T120085 [22:56:17] for Fresnel, where I didn't want a redirect, I worked around it by targetting /w/?_mainpagehack=1 [22:56:25] which works because the default title is local main page [22:56:32] and with non-empty query param, it won't redirect [23:05:44] rzl: effie btw, we're enabling on-host tier for the big one (wancache) on beta cluster now [23:07:41] XioNoX: I'm going to merge "Remove duplicated fake netbox keys" from labs/private [23:12:48] https://en.wikipedia.org/w/api.php?modules=query+siteinfo [23:13:22] I think namespaces|specialpagealiases|magicwords|languages|extensiontags would hit a decent amount [23:23:46] Krinkle: how's https://gerrit.wikimedia.org/r/c/operations/puppet/+/700716 ? [23:25:13] legoktm: yeah, LGTM [23:25:22] also hits a fair amount of Language stuff I believe [23:26:46] https://gerrit.wikimedia.org/g/mediawiki/core/+/2ec406ecc8fcae175a7afe4f2dab5b9f5d44cd70/includes/api/ApiQuerySiteinfo.php#740 looks like someone optimized it, so it doesn't hit Language anymore