[00:41:45] meh, cirrus-integ02 ran out of diskspace, broke a bunch of stuff inside the vagrant container. Was trying to fix it but too many different threads to pull on...will probably try and re-provision tomorrow [00:41:56] (it's the host running cindy integration testing for es710) [08:42:18] Got my search platform world tour tshirt today! Thanks to Trey for all the tie-dying and to gehel for shipping it. I loved it! https://usercontent.irccloud-cdn.com/file/kBfKKkvl/IMG20220816143557.jpg [13:09:52] greetings [15:03:27] \o [15:25:58] workout, back in ~40 [15:41:09] FWiW, I'm having issues acking that ES masters alert in icinga in case someone asks. The master downtime is expected [16:11:50] * ebernhardson wonders why the eqiad mjolnir daemon doesn't seem to be processing the standard search updates (but it finished the prioritized queue) [16:16:06] inflatador: the alert was already recovered when you sent that msg i'm pretty sure [16:17:28] ryankemper yeah, it looks that way in icinga, although I never saw a recovery msg in operations. Maybe there isn't one [16:17:41] inflatador: > RECOVERY - ElasticSearch numbers of masters eligible - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert [16:17:58] (or maybe I'm just oblivious) ;P [16:18:02] at :35 mins [16:18:06] :P [16:28:53] meh, the problem with eqiad kafka mjolnir daemon is the offset was reset since it was broken for a week, i thought those were supposed to last more than a week now [16:31:49] * ebernhardson looks into how to set it back to the value seen in grafana/prometheus [20:31:05] I'm also not convinced that this is a factor in whatever issues you might be experiencing, my guess is that it's been broken for awhile [20:31:26] happy to provision a new cert though if you like, it will just take some time [20:31:49] inflatador: I'm not sure re. pointing it at another host, "probably" is the answer, but no idea how. Beta is something I keep a half-eye on, nothing more :/ [20:32:55] TheresNoTime: it'll be a config patch [20:32:56] i would suspect the same, i suspect you can filter all log messages from 'host:deploy-mwmaint02', those are not responses to web requests but are instead maintenance scripts run on a timer that don't directly effect responses [20:33:10] TheresNoTime OK, I wasn't sure of the urgency here [20:33:49] inflatador: https://github.com/wikimedia/operations-mediawiki-config/blob/master/wmf-config/LabsServices.php#L19 [20:34:32] RhinosF1 Ah, good catch [20:34:40] the certificate on 09 is for 'deployment-elastic09.deployment-prep.eqiad.wmflabs' [20:35:19] inflatador: that could be why [20:35:26] If no cert on the new domain [20:35:48] inflatador: it's been down hard for a few hours, so only urgent for the people who care :D /j [20:36:02] I added that config a few months back: https://github.com/wikimedia/operations-mediawiki-config/commit/9051405855a7d2fdf9823ffc8e869e487ddb8693 [20:36:26] so again, I'm a bit suspicious that it is a factor now. But the cert does need to be replaced, I'll start a task fro that [20:39:46] I'll be honest, I have to agree, it shouldn't cause a full outage [20:40:35] yeah don't take my word for it i.r.t. elasticsearch having anything to do with the outage, I'm clutching at straws — definitely don't commit any serious time to looking at this y'all :) [20:41:13] TheresNoTime no worries at all, I'll create a task and fix the cert when I have time (probably by EoW). Good luck on the rest of it [20:41:26] inflatador: if it's late for you, no one's going to cry if we look at why beta has exploded in the morning [20:41:40] TheresNoTime: ^ to you too. [20:41:53] RhinosF1 Oh yeah, that's fine...but we do need to fix that cert eventually. Thanks for pointing it out [20:41:58] Beta is just timing out altogether for me [20:42:17] So I gonna blame the text-cache server as guess #1 [20:42:54] Also confused why 'curl -k' didn't let me connect and instead gave the SSL routines error [20:43:42] I pinged the train conductors [20:43:51] So they can go with caution [21:21:51] school run, back in ~45m [22:05:14] back [22:12:12] meh, of course. The branch in cirrus is es710, so the branch in vendor has to be es710. But for some reason i created es7... [22:12:39] * ebernhardson wonders if he can delete branches in gerrit, or only create them :P [22:13:19] yup, only create :P I guess not the end of the world: [remote rejected] es7 (prohibited by Gerrit: not permitted: delete)