[07:03:08] <moritzm>	 headsup/reminder: cumin1001 will be rebooted in an hour
[07:37:05] <vgutierrez>	 _joe_: are you deploying the new purged? should I take care of that?
[07:48:11] <_joe_>	 vgutierrez: if you do it will be much appreciated :)
[07:48:17] <vgutierrez>	 will do
[07:48:26] <_joe_>	 also I'm not available next week for any fallout heh
[07:52:21] <XioNoX>	 godog, any idea what happened to LibreNMS? https://librenms.wikimedia.org/ we're about to start router upgrades and that's a hard blocker for us
[07:58:40] <XioNoX>	 jbond: is it related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/809095 ?
[08:03:46] <XioNoX>	 I'm getting this error message on the front page, direct links works, but it works for other people https://usercontent.irccloud-cdn.com/file/XcGOVKs1/Screenshot%202022-09-08%20at%2010-02-59%20https%20__librenms.wikimedia.org.png
[08:10:27] <moritzm>	 looking at the puppet logs on netmon1002 I don't see any direct fallout from the realm default change
[08:12:55] <XioNoX>	 moritzm: it's netmon1003 now
[08:13:09] <XioNoX>	 but it's weird that the issue is only for me :)
[08:13:19] <moritzm>	 ah :-)
[08:14:12] <moritzm>	 maybe you have a stale session, did you try to log out?
[08:16:11] <_joe_>	 no there is a problem
[08:16:14] <_joe_>	 drwxrwxr-x 27 root            librenms        4096 Aug 24 04:11 de2bd0369fc46effba4a4ca9ebafc95b40b1af22
[08:16:30] <_joe_>	 in /srv/deployment/librenms/librenms-cache/revs
[08:18:24] <godog>	 XioNoX: checking
[08:18:51] <godog>	 yeah works for me, odd
[08:21:24] <godog>	 XioNoX: still doesn't work for you ?
[08:25:12] <XioNoX>	 godog: yeah, same
[08:25:23] <XioNoX>	 I'm doing the router upgrade so will be slow to reply here
[08:27:06] <godog>	 ack, will keep investigating
[08:29:55] <godog>	 XioNoX: any better now ?
[08:30:08] <XioNoX>	 godog: yep!
[08:30:10] <XioNoX>	 thanks!
[08:30:25] <godog>	 sure! not sure what happened yet
[08:36:10] <godog>	 I think it might be fallout from I78e824b40
[08:39:46] <godog>	 filed as https://phabricator.wikimedia.org/T317286
[08:40:02] <godog>	 denisse|m: ^ if you have time/bandwidth to look into this cc jbond 
[08:41:39] <godog>	 ok I think it is broken again :(
[08:41:48] <godog>	 jbond: thoughts?
[08:45:27] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/830789 if anyone can take a look / +1 that'd be appreciated
[08:55:35] * jbond reading backlog
[08:57:22] <godog>	 thank you, the patch did bandaid the problem and I've updated the task
[08:58:46] <vgutierrez>	 hmm I'm trying to build purged on build2001 and it's failing to find the build dependency prometheus-rdkafka-exporter (>= 0.2), but according to reprepro it's there: buster-wikimedia|main|amd64: prometheus-rdkafka-exporter 0.2. I'm building it with WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=o 
[09:00:03] <vgutierrez>	 hmmm even with DIST=buster the base image for the build is bullseye
[09:00:31] <vgutierrez>	 do I need to get prometheus-rdkafka-exporter in our bullseye apt repo?
[09:01:37] <volans>	 vgutierrez: are you using gbp buildpackage?
[09:02:00] <vgutierrez>	 volans: yep, complete CLI: WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=no gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder
[09:02:20] <volans>	 then you need to replace --git-builder=git-pbuilder with --git-pbuilder --git-dist=${DISTRO} AFAIK
[09:02:44] <volans>	 I also set --git-no-pbuilder-autoconf --git-color=on --git-arch=amd64 FYI
[09:02:54] <volans>	 but YMMV
[09:03:57] <volans>	 (you can check also ~/spicerack-release for example in my home)
[09:05:14] <volans>	 vgutierrez ^^^
[09:05:26] <vgutierrez>	 thx
[09:06:30] <vgutierrez>	 it was failing for me.. but yeah distro must be buster-wikimedia and not buster
[09:09:00] <volans>	 in the changelog
[09:09:09] <volans>	 in the CLI params I use buster
[09:10:10] <XioNoX>	 godog: any idea why it was not working just for me? :)
[09:11:16] <godog>	 XioNoX: bad luck probably, not 100% sure
[09:26:26] <vgutierrez>	 _joe_, ori: purged 0.18 looks good in cp4026 & cp4032, varnish querysorting seems to be a NOOP for PURGE requests after the update
[09:28:01] <godog>	 jbond: I don't understand why you removed the recurse in https://gerrit.wikimedia.org/r/c/operations/puppet/+/830792 ? that broke things agian
[09:28:21] <godog>	 i.e. the sessions in  /srv/deployment/librenms/librenms-cache/revs/de2bd0369fc46effba4a4ca9ebafc95b40b1af22/storage/framework/sessions/ are now owned by deploy-librenms
[09:29:25] <godog>	 sorry but I'm a little frustrated now
[09:30:47] <elukey>	 jbond: let's wait for +1s to proceed for librenms, it seems that we are not in sync with what to do :)
[09:30:53] <moritzm>	 there's a typo in that patch "deploy-librenm" which should have a training s, that might be the sole issue?
[09:31:13] <moritzm>	 trailing
[09:31:23] <godog>	 moritzm: no, the issue is the sessions directory being owned by deploy-librenms and not www-data
[09:31:40] <godog>	 I'm going to add the recurse back in
[09:31:58] <claime>	 Or setuid/setgid?
[09:32:53] <godog>	 that won't help, puppet (for reasons I don't understand yet) wants to recursively chown to deploy-librenms
[09:34:46] <godog>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/830795
[09:37:04] <godog>	 fixed, I'm stepping afk for a little while
[09:39:02] <jbond>	 godog: elukey: see update on the phab task https://phabricator.wikimedia.org/T317286#8220348
[09:39:40] <jbond>	 godog: as per the comment those files should not be world readable
[09:39:57] <jbond>	 so im going to revert again
[09:41:34] <jbond>	 sorry should be world readable
[09:42:16] * jbond is checking blame history to see why that coimment is there
[09:46:06] <godog>	 jbond: I'm not trying to be dense but the files are not world readable
[09:46:25] <godog>	 this is the directory /srv/deployment/librenms/librenms-cache/revs/de2bd0369fc46effba4a4ca9ebafc95b40b1af22/storage/framework/sessions/
[09:47:12] <jbond>	 godog: librenms/apache write the files as 0644, if we have recurse true oit means that puppet preformes a change every tme the session is written which means that librenms often shows up in the puppet makes a change on every ru report https://gerrit.wikimedia.org/r/c/operations/puppet/+/573268
[09:47:58] <godog>	 jbond: ack, got it, thank you
[09:48:05] <godog>	 seems all good now
[09:48:42] <jbond>	 godog: there could be a follow up action to make librenms write them as 06600 which is arguably better but i havn;t explored that so not sure how smple it would be
[09:49:30] <godog>	 yeah that could work too
[09:50:01] <XioNoX>	 I think one issue is that the command line (and cront) and the UI  use different users
[09:50:08] <XioNoX>	 cron*
[09:51:16] <godog>	 that too, though the cli shouldn't touch sessions afaik
[09:51:43] <XioNoX>	 indeed
[09:52:03] <jbond>	 fyi i created https://phabricator.wikimedia.org/T317292
[09:52:22] <godog>	 thank you
[09:52:44] <jbond>	 thanks and sorry for the lack of cordination, still missing coffee this morning :)
[09:59:41] <godog>	 hehhe it happens, especially without fuel, known in some cirles as coffee
[10:00:00] <XioNoX>	 the esams/knams routers upgrade is finished. We didn't upgrade one of the routers because a firmware upgrade needs JTAC, but we will be repooling esams soon-ish
[11:33:40] <moritzm>	 FYI, I'm disabling Puppet on the 72 servers with the new H750 for about 15 mins
[14:32:10] <ori>	 vgutierrez: excellent, thank you
[15:05:48] <Krinkle>	 I'm reviewing August's incidents to summarise impact and more generally to keep up with what's going on. 
[15:05:53] <Krinkle>	 https://docs.google.com/document/d/1ywpgDhSnoRjYTNWUzlcqRBwu98dAbEwqd8ZyyH2o5Nk/edit#heading=h.vg6rb6x2eccy
[15:06:09] <Krinkle>	 This was the 2022-08-01 incident with restbase after an enwiki template issue.
[15:06:58] <Krinkle>	 If I recall correctly, we've seen this kind of explosion a number of times now where MW and the refreshlinks jobqueue propagation is fine, but restbase/changeprop way of propagation is seemingly way more aggressive in a way that creates these load issuses.
[15:07:19] <Krinkle>	 Do others sense that as well or is it just that we notice the restbase one earlier or stronger for other reasons?
[15:07:44] <_joe_>	 the source is different than in the past, and the fallout too
[15:07:49] <_joe_>	 let me review the doc quickly
[15:07:55] <_joe_>	 ah the CS1 thing
[15:08:12] <_joe_>	 no, that was a one-off special and has nothing to do with changeprop
[15:08:35] <_joe_>	 every CS1 transclusion included an empty <math> tag IIRC
[15:08:49] <_joe_>	 so we got 80k rps to restbase for a mathoid url
[15:09:10] <_joe_>	 because of the amplification effect of CS1 being called hundreds of times per page
[15:10:24] <_joe_>	 the historical problems with restbase were that some pages with a lot of lua code would fan out even 100s of calls to the mediawiki api 
[15:10:27] <_joe_>	 overloading it
[15:15:05] <Krinkle>	 ah, this is appserver calls *to* restbase, and then the inevitable call (kind of back) to api_appserver
[15:15:25] * Krinkle notices that jobrunner are not part of RED dash and would like that very much
[15:15:38] <Krinkle>	 I recall filing a task for that at some point
[15:15:56] <Krinkle>	 ref T293943
[15:15:57] <stashbot>	 T293943: Enable mediawiki appserver metrics for jobrunner hosts - https://phabricator.wikimedia.org/T293943
[15:17:03] <Krinkle>	 ok, I see is there as well https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?from=1659344018862&orgId=1&to=1659391358955&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-site=All&var-method=GET&var-code=200&var-php_version=All&folder=current
[15:17:05] <Krinkle>	 makes sense
[15:17:08] <Krinkle>	 thanks
[15:33:00] <_joe_>	 !incidents
[15:33:01] <sirenbot>	 No incidents occurred in the past 24 hours for team SRE
[15:33:03] <_joe_>	 sigh
[15:33:25] <_joe_>	 godog: the etcdmirror alert on icinga used to page, the one on alertmanager should to
[15:37:05] <_joe_>	 we just caused a near disaster because of that isn't paging
[15:38:33] <godog>	 _joe_: the alert does page, though the threshold wasn't breached for long enough
[15:38:46] <_joe_>	 godog: uh?
[15:38:54] <_joe_>	 it sent an alert to irc
[15:38:56] <_joe_>	 jinxer-wm>	(JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:39:04] <_joe_>	 what is the threshold?
[15:39:10] <_joe_>	 it should be 1 minute really
[15:39:18] <godog>	 > 50 for 15m
[15:39:25] <godog>	 https://gerrit.wikimedia.org/r/c/operations/alerts/+/810918/1/team-sre/etcd.yaml this guy
[15:39:48] <godog>	 jobunavailable isn't the paging alert for etcdmirror lagging
[15:41:32] <_joe_>	 yeah ok no
[15:41:40] <_joe_>	 I'll amend
[15:42:04] <_joe_>	 that measures lag
[15:42:14] <_joe_>	 not if the daemon is working at all
[15:42:27] <akosiaris>	 that one did alert, just 4m ago
[15:42:28] <_joe_>	 what I mean is, we shoud page if the daemon is down
[15:42:36] <akosiaris>	 which is pretty late and anyway couldn't help us in this case
[15:42:38] <_joe_>	 akosiaris: yeah when we restarted it :)
[15:42:52] <_joe_>	 the point is, etcdmirror is designed to crash if soemthing is iffy
[15:42:57] <_joe_>	 so that a human can take a look
[15:43:16] <_joe_>	 the lag alert is for the cases where it's overwhelmed
[15:43:24] <_joe_>	 so that's what's missing right now
[15:43:32] <_joe_>	 the lag alert makes sense as codified there
[15:44:15] <_joe_>	 godog: would up() on the prometheus job do the trick?
[15:44:54] <godog>	 _joe_: yeah up{job="etcdmirror"} should work indeed, https://thanos.wikimedia.org/graph?g0.expr=up%7Bjob%3D%22etcdmirror%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
[15:44:59] <_joe_>	 yeah
[15:45:05] <_joe_>	 ok, will send a patch shortly
[15:45:18] <_joe_>	 this was quite the scare, heh
[15:47:02] <godog>	 would sth bad have happened on its own with etcdmirror not running ?
[15:47:23] <godog>	 I'm asking because if yes then we're relying on a single daemon on a single host
[15:51:30] <_joe_>	 it is extremely stable
[15:51:42] <_joe_>	 but if it breaks it needs tending like replica to a db replica
[15:51:45] <_joe_>	 which pages
[15:51:59] <_joe_>	 basically we can't deploy code when it's down, heh
[15:54:36] <godog>	 ack, got it
[16:35:21] <andrewbogott>	 moritzm: I am seeing a whole lot of RAID alerts starting about three hours ago (e.g. 'WARNING: unexpectedly checked no devices').  Seems like you merged some patches around then
[16:35:37] <andrewbogott>	 Are  you around to help me understand what's happening?  If not I can just ack and wait for tomorrow.
[16:40:32] <andrewbogott>	 T317344
[16:40:33] <stashbot>	 T317344: New RAID alerts (e.g. WARNING: unexpectedly checked no devices) - https://phabricator.wikimedia.org/T317344
[17:16:01] <moritzm>	 andrewbogott: I think those are caused by inconsistencies in their setup exposed by the now more precise RAID detection, e.g. cloudvirt1017 has mdadm setup, but not mdX devices configured
[17:16:25] <moritzm>	 I can have a closer look tomorrow
[17:16:31] <andrewbogott>	 ok, thanks
[17:16:40] <andrewbogott>	 Most of those hosts were reimaged somewhat recently.
[17:30:26] <Krinkle>	 I created a placeholder for https://wikitech.wikimedia.org/wiki/Incidents/2022-08-24_swift . More details would be welcome, e.g. from https://docs.google.com/document/d/1tS9dB_pQK3PV2MDpjuAYMJiOmgQJRmfr32fONphTkaQ/edit# cc godog moritzm vgutierrez 
[18:26:04] <vgutierrez>	 Will do. Thanks Krinkle