[07:03:08] headsup/reminder: cumin1001 will be rebooted in an hour [07:37:05] _joe_: are you deploying the new purged? should I take care of that? [07:48:11] <_joe_> vgutierrez: if you do it will be much appreciated :) [07:48:17] will do [07:48:26] <_joe_> also I'm not available next week for any fallout heh [07:52:21] godog, any idea what happened to LibreNMS? https://librenms.wikimedia.org/ we're about to start router upgrades and that's a hard blocker for us [07:58:40] jbond: is it related to https://gerrit.wikimedia.org/r/c/operations/puppet/+/809095 ? [08:03:46] I'm getting this error message on the front page, direct links works, but it works for other people https://usercontent.irccloud-cdn.com/file/XcGOVKs1/Screenshot%202022-09-08%20at%2010-02-59%20https%20__librenms.wikimedia.org.png [08:10:27] looking at the puppet logs on netmon1002 I don't see any direct fallout from the realm default change [08:12:55] moritzm: it's netmon1003 now [08:13:09] but it's weird that the issue is only for me :) [08:13:19] ah :-) [08:14:12] maybe you have a stale session, did you try to log out? [08:16:11] <_joe_> no there is a problem [08:16:14] <_joe_> drwxrwxr-x 27 root librenms 4096 Aug 24 04:11 de2bd0369fc46effba4a4ca9ebafc95b40b1af22 [08:16:30] <_joe_> in /srv/deployment/librenms/librenms-cache/revs [08:18:24] XioNoX: checking [08:18:51] yeah works for me, odd [08:21:24] XioNoX: still doesn't work for you ? [08:25:12] godog: yeah, same [08:25:23] I'm doing the router upgrade so will be slow to reply here [08:27:06] ack, will keep investigating [08:29:55] XioNoX: any better now ? [08:30:08] godog: yep! [08:30:10] thanks! [08:30:25] sure! not sure what happened yet [08:36:10] I think it might be fallout from I78e824b40 [08:39:46] filed as https://phabricator.wikimedia.org/T317286 [08:40:02] denisse|m: ^ if you have time/bandwidth to look into this cc jbond [08:41:39] ok I think it is broken again :( [08:41:48] jbond: thoughts? [08:45:27] https://gerrit.wikimedia.org/r/c/operations/puppet/+/830789 if anyone can take a look / +1 that'd be appreciated [08:55:35] * jbond reading backlog [08:57:22] thank you, the patch did bandaid the problem and I've updated the task [08:58:46] hmm I'm trying to build purged on build2001 and it's failing to find the build dependency prometheus-rdkafka-exporter (>= 0.2), but according to reprepro it's there: buster-wikimedia|main|amd64: prometheus-rdkafka-exporter 0.2. I'm building it with WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=o [09:00:03] hmmm even with DIST=buster the base image for the build is bullseye [09:00:31] do I need to get prometheus-rdkafka-exporter in our bullseye apt repo? [09:01:37] vgutierrez: are you using gbp buildpackage? [09:02:00] volans: yep, complete CLI: WIKIMEDIA=yes BACKPORTS=yes ARCH=amd64 DIST=buster GIT_PBUILDER_AUTOCONF=no gbp buildpackage -jauto -us -uc -sa --git-builder=git-pbuilder [09:02:20] then you need to replace --git-builder=git-pbuilder with --git-pbuilder --git-dist=${DISTRO} AFAIK [09:02:44] I also set --git-no-pbuilder-autoconf --git-color=on --git-arch=amd64 FYI [09:02:54] but YMMV [09:03:57] (you can check also ~/spicerack-release for example in my home) [09:05:14] vgutierrez ^^^ [09:05:26] thx [09:06:30] it was failing for me.. but yeah distro must be buster-wikimedia and not buster [09:09:00] in the changelog [09:09:09] in the CLI params I use buster [09:10:10] godog: any idea why it was not working just for me? :) [09:11:16] XioNoX: bad luck probably, not 100% sure [09:26:26] _joe_, ori: purged 0.18 looks good in cp4026 & cp4032, varnish querysorting seems to be a NOOP for PURGE requests after the update [09:28:01] jbond: I don't understand why you removed the recurse in https://gerrit.wikimedia.org/r/c/operations/puppet/+/830792 ? that broke things agian [09:28:21] i.e. the sessions in /srv/deployment/librenms/librenms-cache/revs/de2bd0369fc46effba4a4ca9ebafc95b40b1af22/storage/framework/sessions/ are now owned by deploy-librenms [09:29:25] sorry but I'm a little frustrated now [09:30:47] jbond: let's wait for +1s to proceed for librenms, it seems that we are not in sync with what to do :) [09:30:53] there's a typo in that patch "deploy-librenm" which should have a training s, that might be the sole issue? [09:31:13] trailing [09:31:23] moritzm: no, the issue is the sessions directory being owned by deploy-librenms and not www-data [09:31:40] I'm going to add the recurse back in [09:31:58] Or setuid/setgid? [09:32:53] that won't help, puppet (for reasons I don't understand yet) wants to recursively chown to deploy-librenms [09:34:46] https://gerrit.wikimedia.org/r/c/operations/puppet/+/830795 [09:37:04] fixed, I'm stepping afk for a little while [09:39:02] godog: elukey: see update on the phab task https://phabricator.wikimedia.org/T317286#8220348 [09:39:40] godog: as per the comment those files should not be world readable [09:39:57] so im going to revert again [09:41:34] sorry should be world readable [09:42:16] * jbond is checking blame history to see why that coimment is there [09:46:06] jbond: I'm not trying to be dense but the files are not world readable [09:46:25] this is the directory /srv/deployment/librenms/librenms-cache/revs/de2bd0369fc46effba4a4ca9ebafc95b40b1af22/storage/framework/sessions/ [09:47:12] godog: librenms/apache write the files as 0644, if we have recurse true oit means that puppet preformes a change every tme the session is written which means that librenms often shows up in the puppet makes a change on every ru report https://gerrit.wikimedia.org/r/c/operations/puppet/+/573268 [09:47:58] jbond: ack, got it, thank you [09:48:05] seems all good now [09:48:42] godog: there could be a follow up action to make librenms write them as 06600 which is arguably better but i havn;t explored that so not sure how smple it would be [09:49:30] yeah that could work too [09:50:01] I think one issue is that the command line (and cront) and the UI use different users [09:50:08] cron* [09:51:16] that too, though the cli shouldn't touch sessions afaik [09:51:43] indeed [09:52:03] fyi i created https://phabricator.wikimedia.org/T317292 [09:52:22] thank you [09:52:44] thanks and sorry for the lack of cordination, still missing coffee this morning :) [09:59:41] hehhe it happens, especially without fuel, known in some cirles as coffee [10:00:00] the esams/knams routers upgrade is finished. We didn't upgrade one of the routers because a firmware upgrade needs JTAC, but we will be repooling esams soon-ish [11:33:40] FYI, I'm disabling Puppet on the 72 servers with the new H750 for about 15 mins [14:32:10] vgutierrez: excellent, thank you [15:05:48] I'm reviewing August's incidents to summarise impact and more generally to keep up with what's going on. [15:05:53] https://docs.google.com/document/d/1ywpgDhSnoRjYTNWUzlcqRBwu98dAbEwqd8ZyyH2o5Nk/edit#heading=h.vg6rb6x2eccy [15:06:09] This was the 2022-08-01 incident with restbase after an enwiki template issue. [15:06:58] If I recall correctly, we've seen this kind of explosion a number of times now where MW and the refreshlinks jobqueue propagation is fine, but restbase/changeprop way of propagation is seemingly way more aggressive in a way that creates these load issuses. [15:07:19] Do others sense that as well or is it just that we notice the restbase one earlier or stronger for other reasons? [15:07:44] <_joe_> the source is different than in the past, and the fallout too [15:07:49] <_joe_> let me review the doc quickly [15:07:55] <_joe_> ah the CS1 thing [15:08:12] <_joe_> no, that was a one-off special and has nothing to do with changeprop [15:08:35] <_joe_> every CS1 transclusion included an empty tag IIRC [15:08:49] <_joe_> so we got 80k rps to restbase for a mathoid url [15:09:10] <_joe_> because of the amplification effect of CS1 being called hundreds of times per page [15:10:24] <_joe_> the historical problems with restbase were that some pages with a lot of lua code would fan out even 100s of calls to the mediawiki api [15:10:27] <_joe_> overloading it [15:15:05] ah, this is appserver calls *to* restbase, and then the inevitable call (kind of back) to api_appserver [15:15:25] * Krinkle notices that jobrunner are not part of RED dash and would like that very much [15:15:38] I recall filing a task for that at some point [15:15:56] ref T293943 [15:15:57] T293943: Enable mediawiki appserver metrics for jobrunner hosts - https://phabricator.wikimedia.org/T293943 [15:17:03] ok, I see is there as well https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?from=1659344018862&orgId=1&to=1659391358955&var-cluster=appserver&var-datasource=eqiad%20prometheus%2Fops&var-site=All&var-method=GET&var-code=200&var-php_version=All&folder=current [15:17:05] makes sense [15:17:08] thanks [15:33:00] <_joe_> !incidents [15:33:01] No incidents occurred in the past 24 hours for team SRE [15:33:03] <_joe_> sigh [15:33:25] <_joe_> godog: the etcdmirror alert on icinga used to page, the one on alertmanager should to [15:37:05] <_joe_> we just caused a near disaster because of that isn't paging [15:38:33] _joe_: the alert does page, though the threshold wasn't breached for long enough [15:38:46] <_joe_> godog: uh? [15:38:54] <_joe_> it sent an alert to irc [15:38:56] <_joe_> jinxer-wm> (JobUnavailable) firing: Reduced availability for job etcdmirror in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:39:04] <_joe_> what is the threshold? [15:39:10] <_joe_> it should be 1 minute really [15:39:18] > 50 for 15m [15:39:25] https://gerrit.wikimedia.org/r/c/operations/alerts/+/810918/1/team-sre/etcd.yaml this guy [15:39:48] jobunavailable isn't the paging alert for etcdmirror lagging [15:41:32] <_joe_> yeah ok no [15:41:40] <_joe_> I'll amend [15:42:04] <_joe_> that measures lag [15:42:14] <_joe_> not if the daemon is working at all [15:42:27] that one did alert, just 4m ago [15:42:28] <_joe_> what I mean is, we shoud page if the daemon is down [15:42:36] which is pretty late and anyway couldn't help us in this case [15:42:38] <_joe_> akosiaris: yeah when we restarted it :) [15:42:52] <_joe_> the point is, etcdmirror is designed to crash if soemthing is iffy [15:42:57] <_joe_> so that a human can take a look [15:43:16] <_joe_> the lag alert is for the cases where it's overwhelmed [15:43:24] <_joe_> so that's what's missing right now [15:43:32] <_joe_> the lag alert makes sense as codified there [15:44:15] <_joe_> godog: would up() on the prometheus job do the trick? [15:44:54] _joe_: yeah up{job="etcdmirror"} should work indeed, https://thanos.wikimedia.org/graph?g0.expr=up%7Bjob%3D%22etcdmirror%22%7D&g0.tab=0&g0.stacked=0&g0.range_input=1h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D [15:44:59] <_joe_> yeah [15:45:05] <_joe_> ok, will send a patch shortly [15:45:18] <_joe_> this was quite the scare, heh [15:47:02] would sth bad have happened on its own with etcdmirror not running ? [15:47:23] I'm asking because if yes then we're relying on a single daemon on a single host [15:51:30] <_joe_> it is extremely stable [15:51:42] <_joe_> but if it breaks it needs tending like replica to a db replica [15:51:45] <_joe_> which pages [15:51:59] <_joe_> basically we can't deploy code when it's down, heh [15:54:36] ack, got it [16:35:21] moritzm: I am seeing a whole lot of RAID alerts starting about three hours ago (e.g. 'WARNING: unexpectedly checked no devices'). Seems like you merged some patches around then [16:35:37] Are you around to help me understand what's happening? If not I can just ack and wait for tomorrow. [16:40:32] T317344 [16:40:33] T317344: New RAID alerts (e.g. WARNING: unexpectedly checked no devices) - https://phabricator.wikimedia.org/T317344 [17:16:01] andrewbogott: I think those are caused by inconsistencies in their setup exposed by the now more precise RAID detection, e.g. cloudvirt1017 has mdadm setup, but not mdX devices configured [17:16:25] I can have a closer look tomorrow [17:16:31] ok, thanks [17:16:40] Most of those hosts were reimaged somewhat recently. [17:30:26] I created a placeholder for https://wikitech.wikimedia.org/wiki/Incidents/2022-08-24_swift . More details would be welcome, e.g. from https://docs.google.com/document/d/1tS9dB_pQK3PV2MDpjuAYMJiOmgQJRmfr32fONphTkaQ/edit# cc godog moritzm vgutierrez [18:26:04] Will do. Thanks Krinkle