[13:02:22] Hello SRE team, we've been seeing some odd behavior in several of our job queues in the past days, and were wondering if this is being looked at or if you have any advice: [13:02:22] 1) `wikibase-injectRCRecords` is exhibiting some odd congestion pattern in the backlog time chart, which casus some alerts to fire on our end: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=wikibase-InjectRCRecords [13:02:22] 2) `refreshLinks`'s backlog time seems to be climbing consistently since the 25th: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=refreshLinks&var-dc=eqiad%20prometheus%2Fk8s [13:02:48] any pointers are appreciated, even if you say this is the right address for this topic :D [13:02:58] Thank you in advance! [13:26:05] godog: want me to merge "Filippo Giunchedi: prometheus: remove per-exporter up checks (959635d865)" ? [13:27:07] andrewbogott: oops! yes please, thank you [13:27:21] done [13:27:26] cheers [14:19:14] itamarWMDE: i dont have an answer but i notice from the 30 day view that the job run duration avg and p99 jumped by an order of magnitude aound the 23rd https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=wikibase-InjectRCRecords&from=now-30d&to=now [14:19:51] <_joe_> ugh yes [14:42:45] there are two nodes still depooled from jobrunner (because of the video scaling issue last weekend) but I guess those two should not have that much of an impact, no? [14:44:39] it's just one, sorry [14:55:23] <_joe_> jayme: yeah I doubt it has [14:55:43] <_joe_> something started making those jobs very slow on the 23rd [14:56:25] <_joe_> for refreshlinks, the slowdown happened today [14:56:28] <_joe_> sorry, yesterday [14:56:53] yeah, at least the linear way in inrcesed [14:57:18] <_joe_> sorry, I have a meeting [15:25:25] <_joe_> heads up, I'm switching to the new requestctl-generated vcl model, see T305606 [15:25:25] T305606: Make the VCL that goes to production from requestctl safer/more explicit to apply - https://phabricator.wikimedia.org/T305606 [15:25:45] <_joe_> so I'm disabling puppet across the edge fleet for a bit [16:20:46] 18:15:37 + - name: STORAGE_URI [16:20:54] nope wrong chan sorry :D [16:33:52] good evening, may I have a puppet merge for a Gerrit config change please? That is to explicitly set a config that will change as part of the Gerrit upgrade from 3.3 to 3.4. https://gerrit.wikimedia.org/r/c/operations/puppet/+/786984 [16:33:57] which I have tested locally :] [16:38:39] hashar: sure, you want me to merge it in? [16:38:55] yes pleas e:) [16:39:09] and run the puppet-merge on the puppet master. I will run puppet on the affected hosts [16:39:33] hashar: done [16:39:53] running puppet :] [16:40:28] <_joe_> everyone, we have released a new version of requestctl which makes it safer to deploy requestctl changes. For now you can see https://wikitech.wikimedia.org/w/index.php?title=Requestctl&type=revision&diff=1974673&oldid=1968652 (or the page itself :P) for a summary of what changed [16:40:33] jhathaway: looks good. Thank you very much ;) [16:40:37] <_joe_> I'll send a more detailed email tomorrow morning [16:40:42] hashar: of course [19:42:33] Anyone have any insight to this failing cron job, ms-fe1009, ms-fe1009> . /etc/swift/account_AUTH_netbox.env && /usr/local/bin/swift-account-stats --prefix swift.eqiad-prod.stats.AUTH_netbox [19:42:56] I took a brief look, but I am not sure exactly what it is trying to do, or why it is getting auth errors now [19:51:54] jhathaway: this seems like an occurence of a previous ticket https://phabricator.wikimedia.org/T159437 [19:52:41] mutante: ah interesting, I should have done some phabricator searches, thanks [19:52:58] jhathaway: I feel like we should just reopen that one [19:53:32] also that's an actual cron, not a timer [19:53:37] so we want to convert that to a timer [19:53:53] though that ticket doesn't say anything about permissions [19:55:10] Account HEAD failed: http://ms-fe.svc.eqiad.wmnet/v1/AUTH_search 401 Unauthorized [19:55:11] there is more than one "account status" job, is this the one for netbox? [19:55:19] account-stats [19:55:20] yeah [19:55:41] what is it trying to do, I don't really understand the intent [19:56:55] I ran it manually and it seemed to work [19:57:15] first sourced the . /etc/swift/account_AUTH_netbox.env [19:57:27] then /usr/local/bin/swift-account-stats --prefix swift.eqiad-prod.stats.AUTH_netbox --statsd-host localhost --statsd-port 9125 [19:57:37] it got stats back [19:58:05] it's getting statistics. number of bytes, objects and containers [19:59:07] and what are the backend nodes behind the lvs service? [19:59:38] maybe just one of them is returning the auth error? [19:59:45] I don't see how it's doing anything with these numbers though. [20:00:01] it was on ms-fe1009 [20:00:12] the same one that the email comes from [20:00:41] but it queries, https://ms-fe.svc.eqiad.wmnet/auth/v1.0 right? [20:00:49] the list of backends is https://config-master.wikimedia.org/pybal/eqiad/swift [20:01:00] or https://config-master.wikimedia.org/pybal/eqiad/swift-https [20:01:02] thanks [20:01:30] yes, export ST_AUTH=https://ms-fe.svc.eqiad.wmnet/auth/v1.0 [20:02:18] so does that hit the local node or the lvs service? [20:05:34] I _think_ the local node because they all the backends have that IP bound on the loopback interface [20:06:05] $ ip ro get 10.2.2.27 [20:06:07] local 10.2.2.27 dev lo src 10.2.2.27 [20:06:09] cache [20:06:11] seems to agree [20:08:11] how about converting it to a timer "while at it". then we can "systemctl start" it a couple times and see if it happens again [20:09:09] as opposed to manually running it as root. I would think it fails to include the config file with the access key, but then why does it only happen for one job and one host.. dunno [20:09:38] be back in a bit, cooking [20:36:18] jhathaway: The cron ran and there was no new email. [20:36:28] I .. restarted cron itself before [20:37:33] and all of these jobs try to run at the exact same moment every minute [20:47:00] I wonder if it is a problem with the proxy service [20:48:44] I see normal 204 entries in the proxy log for successful requests, but for the 401 I don't see a log entry [20:52:42] mutante: do you know how to depool the node so we can try restarting the proxy? [20:55:02] jhathaway: and ..we got another one. I think it's just hammering it a bit too often [20:56:31] jhathaway: yes, I depooled it. ([cumin2002:~] $ sudo -i confctl select dc=eqiad,name=ms-fe1009.eqiad.wmnet set/pooled=no) [20:56:59] mutante: great [20:57:26] mutante: shall I restart the proxy? [20:57:39] jhathaway: yea [20:58:11] okay restarted [20:58:59] shall we repool? [20:59:38] done. repooled [20:59:43] thanks [21:00:57] I can't reproduce yet, but it could be coincidental [21:01:19] I could never reproduce it when manually running the commands as root [21:01:30] but let's see if the mail stops [21:01:59] I was able to reproduce every 8 or so runs, when just running it via my shell [21:02:04] maybe those jobs should use more random times they start at [21:02:08] gotcha [21:03:05] mutante: thanks for the help, time will tell, whether we were successful! [21:04:29] yea, like 10 minutes. yw [21:04:42] thanks for looking at cron spam