[07:19:05] <_joe_> !incidents [07:19:05] 3596 (UNACKED) HaproxyUnavailable cache_upload global sre () [07:19:14] <_joe_> !ack 3596 [07:19:15] 3596 (ACKED) HaproxyUnavailable cache_upload global sre () [07:26:50] !incidents [07:26:50] 3596 (ACKED) HaproxyUnavailable cache_upload global sre () [07:26:51] 3597 (UNACKED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [07:26:55] !ack 3597 [07:26:56] 3597 (ACKED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [07:32:21] !inci [07:32:23] !incidents [07:32:24] 3596 (ACKED) HaproxyUnavailable cache_upload global sre () [07:32:24] 3598 (UNACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [07:32:24] 3597 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [07:32:30] !ack 3598 [07:32:30] 3598 (ACKED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [07:35:53] !incidents [07:35:54] 3596 (ACKED) HaproxyUnavailable cache_upload global sre () [07:35:54] 3598 (RESOLVED) ProbeDown sre (103.102.166.240 ip4 upload-https:443 probes/service http_upload-https_ip4 eqsin) [07:35:54] 3597 (RESOLVED) ProbeDown sre (2001:df2:e500:ed1a::2:b ip6 upload-https:443 probes/service http_upload-https_ip6 eqsin) [09:16:00] XioNoX: could you help with getting reviewers for https://gerrit.wikimedia.org/r/c/operations/dns/+/914751 ? [10:56:55] Amir1: are you around? Can you help me fix a failed config deploy? [10:57:15] It aborted because gitlab is down. Now I'm trying to find out how to revert cleanly [10:57:50] effie: here's the gerrit revert, want to +2? https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/917153 [10:59:14] duesen: I need to go to a meeting right now but it's fine if you leave it as is. [10:59:37] Our deployment shouldn't be tied to gitlab being up tbh, that's a high risk during outages [10:59:57] that is the part that is kind of suprising [11:00:18] Amir1: you mean that we can leave as is and deploy as soon as gitlab is back? [11:00:25] yeah [11:01:18] ok cheers [11:01:20] thank you [11:02:17] Amir1: ok, I'll go and have lunch, and try again in a bit. [11:03:51] duesen: gitlab is back already [11:04:06] !log conflig deployment failed because gitlab is down. Prod is out of sync with gerrit, and deploy1002 is in sync with gerrit. Will come back to thin in an hour. [11:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:36] but yeah, gitlab being down shouldn't block deployments, especially if we're doing multi-hour gitlab outages during normal working hours :/ [11:18:50] effie: ok, gitlab is back, and I'm back. ready to try again? [11:19:00] sure go ahead please [11:20:57] scap running [11:23:12] I filed a task to investigate what went on [11:23:32] cool, thanks [11:23:35] syncing to prod now [11:23:49] what graphs should I be looking at? [11:26:48] I can't find a dashboard that shows cpu load on job runners [11:28:05] I'm seeing a slight upward trend on https://grafana.wikimedia.org/goto/uf0MdgU4k?orgId=1 [11:29:39] that is the one to start with yes [11:31:47] duesen: also this https://logstash.wikimedia.org/goto/6e9c68bcbd9e06a44b8c377363ee106d [11:31:54] it is so far so good [11:33:49] effie: i'd love to have a way to verify that the jobs are actually running. [11:34:23] Unfortunately, they are effectively redundant to the updates triggered by restbase ight now, so it's impossible to tell form the outside. [11:35:06] Do you have a way to see how many of a certain kinds of job get scheduled/executed? [11:35:32] for the time being I think having a look at raw logs it is a start [11:36:01] will get back to you on that [11:36:40] Oh no, got another error: [11:36:43] https://www.irccloud.com/pastebin/CZa8i5Et/ [11:37:12] duesen: I think that comes because you had failures on some hosts that were down earlier? [11:37:46] I don't know? Possibly? [11:38:01] It looks like it completed succesfully, I'm not seeing any failures scimming the output [11:38:04] but the exist code is 1 [11:38:29] full output? [11:38:37] we didnt have any hosts down earlier to our knowledge [11:38:55] effie: ssh: connect to host mw2448.codfw.wmnet port 22: Connection timed out [11:39:52] that is odd, duesen didn't see any errors [11:41:13] full output here: https://phabricator.wikimedia.org/P47852 [11:41:17] sigh [11:41:32] same error yep [11:42:06] marostegui: ah, right, i missed that when skimming the output, sorry [11:42:21] 11:26:22 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2259.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw2289.codfw.wmnet', 'mw1404.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1398.eqiad.wmnet', 'mw1366.eqiad.wmnet', 'mw1486.eqiad.wmnet', 'deploy1002.eqiad.wmnet', 'mw2300.codfw.wmnet'] (ran as mwdeploy@mw2448.codfw.wmnet) returned [255]: ssh: connect to host mw2448.codfw.wmnet port 22: Connection [11:42:21] timed out [11:42:32] right [11:42:36] so, what now? [11:42:48] get an sre to mark that server inactive [11:43:36] marostegui: can you mark mw2448 inactive? [11:45:11] effie: ^ [11:46:48] yeah we will do so no prob [11:48:10] apparently this server was supposed to be up, but something is dying [11:48:31] ok. so. it looks like nothing is exploding. but i can't see whether the jobs actually run... [11:48:36] duesen: the deployment is fine fyi. You don't need to redeploy or anything. (whoever fixes it will run scap pull to sync before repooling). [11:49:00] RhinosF1: I was hoping that, thanks for confirming! [13:15:43] _joe_: when introducing a new job class, do I need to touch the cpjobqueue config to make it work? [13:16:16] I was under the impression that we had a catch-all config that would handle any new kind of job... [13:16:27] ...but it's not working, I must be doming something wrong :) [13:16:38] Not woking for ParsoidCachePrewarmJob that is [13:17:07] aka 'parsoidCachePrewarm' [13:17:21] <_joe_> duesen: I think we do have a catch-all config, but I'd need to check [13:17:36] <_joe_> also this job is high traffic enough that we'll need a separate configuration anyways [13:17:46] <_joe_> right now I'm going to lunch though [13:17:58] _joe_: ok, no rush. [13:18:22] actually, I just realized that it *does* appear to work. I was looking for the class name instead of the job name... [13:18:34] <_joe_> duesen: ok [13:18:43] <_joe_> I assumed as much tbh (that it was working) [13:19:12] <_joe_> but I'd suggest we do the same configuration we did for htmlCacheUpdate and such high-volume jobs [13:19:15] the current rate seems very low. But it's only enabled on small wikis for now [13:19:32] <_joe_> duesen: btw, will refreshlinks also trigger a parsoid reparse? [13:20:18] effie: I'm seeing effie: I am seeing mediawiki.job.parsoidCachePrewarm show up every now and then... and then it vanishes again. It seems like the rate is just extremely low? [13:20:29] Can it be *that* low? We should see one job for every edit... [13:21:27] https://w.wiki/6gDJ [13:21:29] _joe_: no, not directly. refreshlins only invalidates. we trigger a job when a page is viewed that isn't in the main parser cache. [13:22:09] We assume that if there is a main parser cache miss (probably due to invalidation), and we parse on the fly, we should also trigger a parsoid parse. [13:22:10] duesen ^ does that matches your expectations? [13:22:35] there is a catch-all config for mediawiki.job.*, automatically caught by the low traffic rule [13:22:49] effie: what'S the number? jobs per seond? [13:22:59] hnowlan: thanks! [13:23:00] yes, it is the rate [13:23:33] effie: if it's per second, then ~1.5 sounds about right for the sum of all small wikis. If it'S per minute, it's too low... [13:23:48] <_joe_> it's per second [13:24:18] <_joe_> irate() in prometheus gives you the rate per second over an interval in the timeseries [13:24:43] ok, nice. That'S the conformation I needed, then. Sorry for the confusion, I was looking for the wrong job name... [13:24:46] Thanks all! [13:25:24] cool! [14:27:47] effie, _joe_: jobrunner load seems fairly high according to https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=54&from=now-7d&to=now [14:28:20] If this is the effect of adding one render per second, I'm a bit worried. All wikis combined would be x20 that, at least... [14:28:36] (even without wikidata and commons) [14:28:56] duesen: we have options, like pulling servers off the parsoid cluster and putting them in the jobrunner one [14:29:56] we 22 machines per DC, but lets see for now how it goes [14:31:05] <_joe_> duesen: let's wait a couple days to be sure that it's the parsoid job causing that [14:31:17] <_joe_> I highly doubt that's the case with 1 rps [14:31:27] <_joe_> unless each single request takes 10 minutes [14:34:54] not according to https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&from=now-3h&to=now&viewPanel=3 [14:47:15] <_joe_> yes, so it can't be the cause of that [15:52:20] <_joe_> duesen: at what time did you deploy the warmup job? [16:02:00] _joe_: https://phabricator.wikimedia.org/T329366#8833110 [16:03:20] <_joe_> RhinosF1: yeah I figured it out in the meantime; we had a spike in the latency of jobs but that was one hour later [16:03:41] Cool [17:14:05] _joe_: ftr: 9:20 utc [17:14:50] <_joe_> duesen: yeah much earlier than the latency spike in jobs [17:24:56] stevemunene: FYI there is an sre.hosts.reimage running on cumin1001 for an-worker1132 since Apr 13th, probably waiting for user-input. [17:26:29] checking it out, thanks volans [17:26:46] yw [17:27:24] inflatador: FYI htere is a sre.hosts.decommission for an-airflow1001 on cumin1001 with a DNS diff pending user input [17:30:37] volans ACK, taking a look [17:30:52] thx [19:17:38] got a cook_book that errs with the following msg: Uptime for (host) higher than threshold . Any idea what that's about? The host in question hasn't been rebooted in 139 days [19:18:44] Which cookbook? [19:19:06] sre.hosts.reboot-single [19:24:41] Comes from https://github.com/wikimedia/operations-software-spicerack/blob/6dd9661919463fc9d420e8cd6ee3dd6b4afb56c9/spicerack/remote.py#L558 [19:25:20] inflatador: it should keep checking after that error I think [19:25:26] That's what @retry is for [19:25:39] It's basically making sure the reboot actually happened [19:26:34] That error would mean the host is still online and didn't actually reboot [19:28:42] inflatador: RhinosF1 is correct, that means the cookbook tried to reboot the host but didn't succeed on the first try -- it doesn't mean the uptime itself is the problem, that's just to test whether the host actually rebooted [19:30:02] rzl: it seems to try every 10 seconds, 240 times [19:30:07] So that's 40 minutes? [19:30:31] My guess would be in case a process takes a long time to nicely die [19:31:08] inflatador: I see from logs that you were running the cookbook in dry-run mode, was that intended? that's the expected output in that case [19:31:47] since it doesn't actually reboot the host in a dry run, the uptime is unaffected, so the uptime check fails -- that's also why the retry behavior is disabled in that case, no point in doing it more than once [19:33:37] rzl good catch. PEBKAC all the way [19:34:08] the output could be a little clearer though :) I'll send a patch later today [19:35:05] Thanks, LMK if you need a review [19:35:40] RhinosF1: yeah, it's sort of a catch-all -- slow-terminating process, or the reboot command itself fell into a black hole during a network blip, or any number of other transient things [19:36:04] that behavior is more intended for the case where you're rolling-restarting a whole bunch of hosts, it's common for at least one of them to need a retry, but you shouldn't have to manually babysit them [19:36:37] Fair [19:36:58] I always find the coming back after reboots the scary part [19:37:18] I never thought of the reboot command itself not working [19:37:49] definitely not the most common failure mode, but you do anything enough times, and even the weird stuff comes up [19:38:34] I know that feeling [19:39:15] "anything that's not expressly forbidden is mandatory" [19:39:23] Horizon simply is a failure at work, it requires you to make a drink when rebooting while it cycles through expected errors [19:39:40] inflatador: spoken perfectly! [19:39:58] Tech does everything, normally at demos [19:40:16] I blew a motherboard on the way to my last demo [19:40:47] :o [19:42:15] inflatador: if you're going to transport servers, don't stick a cabinet on wheels in the back of a van [19:42:28] Wrap them probably in shock absorbent stuff [19:43:06] Especially down windy National speed limit roads [19:43:42] We blew the wing mirror too that day [19:43:45] It was fun [19:44:36] Wow! Well, I guess I should expect that given your racing background ;) [19:45:00] Wasn't me driving [19:45:34] All credit to my boss