[07:19:05] <_joe_>	 !incidents
[07:19:05] <sirenbot>	 3596 (UNACKED)  HaproxyUnavailable cache_upload global sre ()
[07:19:14] <_joe_>	 !ack 3596
[07:19:15] <sirenbot>	 3596 (ACKED)  HaproxyUnavailable cache_upload global sre ()
[07:26:50] <volans>	 !incidents
[07:26:50] <sirenbot>	 3596 (ACKED)  HaproxyUnavailable cache_upload global sre ()
[07:26:51] <sirenbot>	 3597 (UNACKED)  ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin)
[07:26:55] <volans>	 !ack 3597
[07:26:56] <sirenbot>	 3597 (ACKED)  ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin)
[07:32:21] <volans>	 !inci
[07:32:23] <volans>	 !incidents
[07:32:24] <sirenbot>	 3596 (ACKED)  HaproxyUnavailable cache_upload global sre ()
[07:32:24] <sirenbot>	 3598 (UNACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[07:32:24] <sirenbot>	 3597 (RESOLVED)  ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin)
[07:32:30] <volans>	 !ack 3598
[07:32:30] <sirenbot>	 3598 (ACKED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[07:35:53] <volans>	 !incidents
[07:35:54] <sirenbot>	 3596 (ACKED)  HaproxyUnavailable cache_upload global sre ()
[07:35:54] <sirenbot>	 3598 (RESOLVED)  ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin)
[07:35:54] <sirenbot>	 3597 (RESOLVED)  ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin)
[09:16:00] <arturo>	 XioNoX: could you help with getting reviewers for https://gerrit.wikimedia.org/r/c/operations/dns/+/914751 ?
[10:56:55] <duesen>	 Amir1: are you around? Can you help me fix a failed config deploy?
[10:57:15] <duesen>	 It aborted because gitlab is down. Now I'm trying to find out how to revert cleanly
[10:57:50] <duesen>	 effie: here's the gerrit revert, want to +2? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/917153
[10:59:14] <Amir1>	 duesen: I need to go to a meeting right now but it's fine if you leave it as is.
[10:59:37] <Amir1>	 Our deployment shouldn't be tied to gitlab being up tbh, that's a high risk during outages 
[10:59:57] <effie>	 that is the part that is kind of suprising 
[11:00:18] <effie>	 Amir1: you mean that we can leave as is and deploy as soon as gitlab is back?
[11:00:25] <Amir1>	 yeah
[11:01:18] <effie>	 ok cheers
[11:01:20] <effie>	 thank you 
[11:02:17] <duesen>	 Amir1: ok, I'll go and have lunch, and try again in a bit. 
[11:03:51] <taavi>	 duesen: gitlab is back already
[11:04:06] <duesen>	 !log conflig deployment failed because gitlab is down. Prod is out of sync with gerrit, and deploy1002 is in sync with gerrit. Will come back to thin in an hour.
[11:04:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:04:36] <taavi>	 but yeah, gitlab being down shouldn't block deployments, especially if we're doing multi-hour gitlab outages during normal working hours :/
[11:18:50] <duesen>	 effie: ok, gitlab is back, and I'm back. ready to try again?
[11:19:00] <effie>	 sure go ahead please 
[11:20:57] <duesen>	 scap running
[11:23:12] <effie>	 I filed a task to investigate what went on 
[11:23:32] <duesen>	 cool, thanks
[11:23:35] <duesen>	 syncing to prod now
[11:23:49] <duesen>	 what graphs should I be looking at?
[11:26:48] <duesen>	 I can't find a dashboard that shows cpu load on job runners
[11:28:05] <duesen>	 I'm seeing a slight upward trend on https://grafana.wikimedia.org/goto/uf0MdgU4k?orgId=1
[11:29:39] <effie>	 that is the one to start with yes 
[11:31:47] <effie>	 duesen: also this https://logstash.wikimedia.org/goto/6e9c68bcbd9e06a44b8c377363ee106d
[11:31:54] <effie>	 it is so far so good 
[11:33:49] <duesen>	 effie: i'd love to have a way to verify that the jobs are actually running. 
[11:34:23] <duesen>	 Unfortunately, they are effectively redundant to the updates triggered by restbase ight now, so it's impossible to tell form the outside.
[11:35:06] <duesen>	 Do you have a way to see how many of a certain kinds of job get scheduled/executed?
[11:35:32] <effie>	 for the time being I think having a look at raw logs it is a start 
[11:36:01] <effie>	 will get back to you on that 
[11:36:40] <duesen>	 Oh no, got another error:
[11:36:43] <duesen>	 https://www.irccloud.com/pastebin/CZa8i5Et/
[11:37:12] <marostegui>	 duesen: I think that comes because you had failures on some hosts that were down earlier?
[11:37:46] <duesen>	 I don't know? Possibly?
[11:38:01] <duesen>	 It looks like it completed succesfully, I'm not seeing any failures scimming the output
[11:38:04] <duesen>	 but the exist code is 1
[11:38:29] <RhinosF1>	 full output?
[11:38:37] <effie>	 we didnt have any hosts down earlier to our knowledge 
[11:38:55] <marostegui>	 effie:  ssh: connect to host mw2448.codfw.wmnet port 22: Connection timed out
[11:39:52] <effie>	 that is odd, duesen didn't see any errors 
[11:41:13] <duesen>	 full output here: https://phabricator.wikimedia.org/P47852
[11:41:17] <effie>	 sigh 
[11:41:32] <marostegui>	 same error yep
[11:42:06] <duesen>	 marostegui: ah, right, i missed that when skimming the output, sorry 
[11:42:21] <RhinosF1>	 11:26:22 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2259.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw2289.codfw.wmnet', 'mw1404.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1398.eqiad.wmnet', 'mw1366.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw2300.codfw.wmnet'] (ran as mwdeploy@mw2448.codfw.wmnet) returned [255]: ssh: connect to host mw2448.codfw.wmnet port 22: Connection 
[11:42:21] <RhinosF1>	 timed out
[11:42:32] <duesen>	 right
[11:42:36] <duesen>	 so, what now?
[11:42:48] <RhinosF1>	 get an sre to mark that server inactive
[11:43:36] <duesen>	 marostegui: can you mark mw2448 inactive?
[11:45:11] <marostegui>	 effie: ^
[11:46:48] <effie>	 yeah we will do so no prob 
[11:48:10] <effie>	 apparently this server was supposed to be up, but something is dying 
[11:48:31] <duesen>	 ok. so. it looks like nothing is exploding. but i can't see whether the jobs actually run...
[11:48:36] <RhinosF1>	 duesen: the deployment is fine fyi. You don't need to redeploy or anything. (whoever fixes it will run scap pull to sync before repooling).
[11:49:00] <duesen>	 RhinosF1: I was hoping that, thanks for confirming!
[13:15:43] <duesen>	 _joe_: when introducing a new job class, do I need to touch the cpjobqueue config to make it work? 
[13:16:16] <duesen>	 I was under the impression that we had a catch-all config that would handle any new kind of job...
[13:16:27] <duesen>	 ...but it's not working, I must be doming something wrong :)
[13:16:38] <duesen>	 Not woking for ParsoidCachePrewarmJob that is
[13:17:07] <duesen>	 aka 'parsoidCachePrewarm'
[13:17:21] <_joe_>	 duesen: I think we do have a catch-all config, but I'd need to check
[13:17:36] <_joe_>	 also this job is high traffic enough that we'll need a separate configuration anyways
[13:17:46] <_joe_>	 right now I'm going to lunch though
[13:17:58] <duesen>	 _joe_: ok, no rush.
[13:18:22] <duesen>	 actually, I just realized that it *does* appear to work. I was looking for the class name instead of the job name...
[13:18:34] <_joe_>	 duesen: ok
[13:18:43] <_joe_>	 I assumed as much tbh (that it was working)
[13:19:12] <_joe_>	 but I'd suggest we do the same configuration we did for htmlCacheUpdate and such high-volume jobs
[13:19:15] <duesen>	 the current rate seems very low. But it's only enabled on small wikis for now
[13:19:32] <_joe_>	 duesen: btw, will refreshlinks also trigger a parsoid reparse?
[13:20:18] <duesen>	 effie: I'm seeing effie: I am seeing mediawiki.job.parsoidCachePrewarm show up every now and then... and then it vanishes again. It seems like the rate is just extremely low? 
[13:20:29] <duesen>	 Can it be *that* low? We should see one job for every edit...
[13:21:27] <effie>	 https://w.wiki/6gDJ  
[13:21:29] <duesen>	 _joe_: no, not directly. refreshlins only invalidates. we trigger a job when a page is viewed that isn't in the main parser cache.
[13:22:09] <duesen>	 We assume that if there is a main parser cache miss (probably due to invalidation), and we parse on the fly, we should also trigger a parsoid parse.
[13:22:10] <effie>	 duesen ^ does that matches your expectations?
[13:22:35] <hnowlan>	 there is a catch-all config for mediawiki.job.*, automatically caught by the low traffic rule
[13:22:49] <duesen>	 effie: what'S the number? jobs per seond?
[13:22:59] <duesen>	 hnowlan: thanks!
[13:23:00] <effie>	 yes, it is the rate 
[13:23:33] <duesen>	 effie: if it's per second, then ~1.5 sounds about right for the sum of all small wikis. If it'S per minute, it's too low...
[13:23:48] <_joe_>	 it's per second
[13:24:18] <_joe_>	 irate() in prometheus gives you the rate per second over an interval in the timeseries
[13:24:43] <duesen>	 ok, nice. That'S the conformation I needed, then. Sorry for the confusion, I was looking for the wrong job name...
[13:24:46] <duesen>	 Thanks all!
[13:25:24] <effie>	 cool!
[14:27:47] <duesen>	 effie, _joe_: jobrunner load seems fairly high according to https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=54&from=now-7d&to=now
[14:28:20] <duesen>	 If this is the effect of adding one render per second, I'm a bit worried. All wikis combined would be x20 that, at least...
[14:28:36] <duesen>	 (even without wikidata and commons)
[14:28:56] <effie>	 duesen: we have options, like pulling servers off the parsoid cluster and putting them in the jobrunner one 
[14:29:56] <effie>	 we  22 machines per DC, but lets see for now how it goes 
[14:31:05] <_joe_>	 duesen: let's wait a couple days to be sure that it's the parsoid job causing that
[14:31:17] <_joe_>	 I highly doubt that's the case with 1 rps
[14:31:27] <_joe_>	 unless each single request takes 10 minutes
[14:34:54] <effie>	 not according to https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-3h&to=now&viewPanel=3
[14:47:15] <_joe_>	 yes, so it can't be the cause of that
[15:52:20] <_joe_>	 duesen: at what time did you deploy the warmup job?
[16:02:00] <RhinosF1>	 _joe_: https://phabricator.wikimedia.org/T329366#8833110
[16:03:20] <_joe_>	 RhinosF1: yeah I figured it out in the meantime; we had a spike in the latency of jobs but that was one hour later
[16:03:41] <RhinosF1>	 Cool
[17:14:05] <duesen>	 _joe_: ftr: 9:20 utc
[17:14:50] <_joe_>	 duesen: yeah much earlier than the latency spike in jobs
[17:24:56] <volans>	 stevemunene: FYI there is an sre.hosts.reimage running on cumin1001 for an-worker1132 since Apr 13th, probably waiting for user-input.
[17:26:29] <stevemunene>	 checking it out, thanks volans 
[17:26:46] <volans>	 yw
[17:27:24] <volans>	 inflatador: FYI htere is a sre.hosts.decommission for an-airflow1001 on cumin1001 with a DNS diff pending user input
[17:30:37] <inflatador>	 volans  ACK, taking a look
[17:30:52] <volans>	 thx
[19:17:38] <inflatador>	 got a cook_book that errs with the following msg:  Uptime for (host)  higher than threshold . Any idea what that's about? The host in question hasn't been rebooted in 139 days
[19:18:44] <RhinosF1>	 Which cookbook?
[19:19:06] <inflatador>	 sre.hosts.reboot-single
[19:24:41] <RhinosF1>	 Comes from https://github.com/wikimedia/operations-software-spicerack/blob/6dd9661919463fc9d420e8cd6ee3dd6b4afb56c9/spicerack/remote.py#L558
[19:25:20] <RhinosF1>	 inflatador: it should keep checking after that error I think
[19:25:26] <RhinosF1>	 That's what @retry is for
[19:25:39] <RhinosF1>	 It's basically making sure the reboot actually happened
[19:26:34] <RhinosF1>	 That error would mean the host is still online and didn't actually reboot
[19:28:42] <rzl>	 inflatador: RhinosF1 is correct, that means the cookbook tried to reboot the host but didn't succeed on the first try -- it doesn't mean the uptime itself is the problem, that's just to test whether the host actually rebooted
[19:30:02] <RhinosF1>	 rzl: it seems to try every 10 seconds, 240 times
[19:30:07] <RhinosF1>	 So that's 40 minutes?
[19:30:31] <RhinosF1>	 My guess would be in case a process takes a long time to nicely die
[19:31:08] <rzl>	 inflatador: I see from logs that you were running the cookbook in dry-run mode, was that intended? that's the expected output in that case
[19:31:47] <rzl>	 since it doesn't actually reboot the host in a dry run, the uptime is unaffected, so the uptime check fails -- that's also why the retry behavior is disabled in that case, no point in doing it more than once
[19:33:37] <inflatador>	 rzl good catch. PEBKAC all the way
[19:34:08] <rzl>	 the output could be a little clearer though :) I'll send a patch later today
[19:35:05] <inflatador>	 Thanks, LMK if you need a review
[19:35:40] <rzl>	 RhinosF1: yeah, it's sort of a catch-all -- slow-terminating process, or the reboot command itself fell into a black hole during a network blip, or any number of other transient things
[19:36:04] <rzl>	 that behavior is more intended for the case where you're rolling-restarting a whole bunch of hosts, it's common for at least one of them to need a retry, but you shouldn't have to manually babysit them
[19:36:37] <RhinosF1>	 Fair
[19:36:58] <RhinosF1>	 I always find the coming back after reboots the scary part
[19:37:18] <RhinosF1>	 I never thought of the reboot command itself not working
[19:37:49] <rzl>	 definitely not the most common failure mode, but you do anything enough times, and even the weird stuff comes up
[19:38:34] <RhinosF1>	 I know that feeling
[19:39:15] <inflatador>	 "anything that's not expressly forbidden is mandatory"
[19:39:23] <RhinosF1>	 Horizon simply is a failure at work, it requires you to make a drink when rebooting while it cycles through expected errors
[19:39:40] <RhinosF1>	 inflatador: spoken perfectly!
[19:39:58] <RhinosF1>	 Tech does everything, normally at demos
[19:40:16] <RhinosF1>	 I blew a motherboard on the way to my last demo
[19:40:47] <inflatador>	 :o
[19:42:15] <RhinosF1>	 inflatador: if you're going to transport servers, don't stick a cabinet on wheels in the back of a van
[19:42:28] <RhinosF1>	 Wrap them probably in shock absorbent stuff
[19:43:06] <RhinosF1>	 Especially down windy National speed limit roads
[19:43:42] <RhinosF1>	 We blew the wing mirror too that day
[19:43:45] <RhinosF1>	 It was fun
[19:44:36] <inflatador>	 Wow! Well, I guess I should expect that given your racing background ;)
[19:45:00] <RhinosF1>	 Wasn't me driving
[19:45:34] <RhinosF1>	 All credit to my boss