[04:17:09] In 45 minutes we're going to failover s2 master [06:51:41] good morning [06:52:04] There are multiple transport links down at the same time :( [06:52:25] 1) cr1 eqiad <-> codfw (Telia) [06:52:51] elukey: thx looking [06:53:21] 2) the recurrent cr2 codfw <-> esams [06:53:39] XioNoX: <3 lemme know if you need help, IIRC we have a task for 2) right? [06:54:00] for Telia there seems to be maintenance [06:54:02] planned telia work, + lumen outage [06:55:07] I see now in the emails, again a Lumen outage [06:55:08] * elukey sigh [06:55:14] XioNoX: thanks for checking [07:20:49] also looks like both links between codfw and eqiad went down overnight: https://librenms.wikimedia.org/bill/bill_id=24/ everything routed through eqord as it should, but it's not a pleasant situation [07:20:58] /cc topranks ^ [07:21:40] <_joe_> wdqs has a full disk since more than 1 day [07:21:59] <_joe_> ema: ^^ cc gehel ryankemper [07:22:08] <_joe_> wdqs2003 sorry [07:24:30] _joe_: looking, thanks [07:24:42] elukey: an-launcher1002 is also all red in icinga [07:25:02] (and has been for the past 7 days) [07:30:09] ema: ahem I am not in analytics :P [07:30:34] jokes aside, there is a task for the alarms IIRC, lemme find it [07:31:13] https://phabricator.wikimedia.org/T287989 [07:31:16] ah that is fixed [07:31:17] mmmm [07:31:30] elukey: I'm pinging the first person I see in /usr/bin/last :) [07:31:54] there are other two errors related to Druid, those are related to work that Ben is doing, already posted some notes in the task [07:32:05] (adding new nodes / decomming etc..) [07:37:26] I'm on vacation. I pinged dcausse and zpapierski. [07:55:59] _joe_: I need to depool wdqs2003 until ryankemper can do a data reset [07:56:10] but I seem to miss sufficient permissions [07:56:21] <_joe_> zpapierski: yeah you need root for that, lemme do that [07:56:28] great, thx [07:56:43] <_joe_> done [07:56:52] thx again [07:58:06] ryankemper, zpapierski: I've created https://phabricator.wikimedia.org/T288501 to track the wdqs2003 issue [07:58:18] thx, ema! [08:38:03] hi, is someone up to date with ongoing network impacts? I remember an advisory from XioNoX a few days ago about limiting non-essential bandwidth? [08:38:33] I am asking if I should disable temporarilly cross-dc backups, which will happen tonight [08:39:28] jynus: it's fine now [08:39:36] thanks, XioNoX [08:39:56] (better safe than sorry in this case!) [08:40:23] for sure! thx for checking [08:40:37] also we are talking 150MBytes/s, nothing too crazy [08:41:38] and scheduled outside of peak hours [10:26:03] gotta love how we consider 1gbps like "nothing too crazy" ;P [10:46:13] vgutierrez, try backing up 12 TB of databases at lower speeds :-) [10:51:22] my last place, we used to joke about legacy 40Gb cables ;p [10:51:31] lol [13:38:30] <_joe_> Amir1: so, did you have a patch for converting the dispatcher? [13:39:52] _joe_: yup, they are the three connected here https://phabricator.wikimedia.org/T288175 [13:40:03] https://gerrit.wikimedia.org/r/c/710520 [13:40:07] this is the main one [13:40:25] _joe_: I'm fairly confident that it would work [13:40:55] (needs https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/710515 to be merged and deployed first though) [13:46:12] we also start working on replacing the whole thing this week or next week but I'm not sure if all of the work will be done before the switchover [13:53:43] <_joe_> I mean it's inelegant but it seems it should to what we want [13:55:49] yeah, the reasoning is that I hope to kill the whole thing with my bare hands in this quarter [14:15:35] _joe_: let me know when/if it's fine for you to deploy it [14:15:56] specially the puppet stuff is out of my hands :D [14:39:15] <_joe_> Amir1: we can do it whenever you want [14:39:33] I mostly want coffee right now [14:40:17] jokes aside I suggest we do it like this, I'll be very likely in the team that gets this fixed this quarter [14:41:08] <_joe_> ack so, when you've merged the mw-config change, I can proceed and merge the puppet one [14:42:30] cool. Thanks. I deploy it now [14:43:34] _joe_: the one for testwikidatawiki can be merged regardless [15:10:54] so it's deployed now, deploy da things whenever you feel comfortable, just have me aside to monitor dispatching [15:15:50] <_joe_> ack thanks [15:36:07] Amir1: I thought systemd timers had as notable property (differing from cron) that they /don't/ start if one is already running? If I'm reading that wmf-config patch commit msg right, you're saying they do start regardless in this case? [15:39:00] " in case the unit to activate is already active at the time the timer elapses it is not restarted, but simply left running. There is no concept of spawning new service instances in this case. Due to this, services with RemainAfterExit= set (which stay around continuously even after the service's main process exited) are usually not suitable for activation via repetitive timers, as they will only [15:39:04] <_joe_> Krinkle: I think the commit message misses a negation [15:39:06] be activated once, and then stay around forever." [15:39:21] ok :) [15:40:49] based on how parser cache jobs spawn with systemd, I also suspect that systemd doesn't actually cancel the timer if one is already running, but rather it buffers/post-pones it. Is that true or am I observing something else? E.g. if the previous day job from 01:00 AM is still running the next day and finishes at 6AM, it seems the job that was meant to start at 1AM then immediately starts at 6AM. [15:40:57] but maybe that was a fluke from something else. [15:51:16] I think it depends on whether Persistent= is true or false (default: false) and if it's type "OnCalendar" or not. "If true, the time when the service unit was last triggered is stored on disk. When the timer is activated, the service unit is triggered immediately if it would have been triggered at least once during the time when the timer was inactive" [15:52:16] source: https://www.freedesktop.org/software/systemd/man/systemd.timer.html the "Persistent" and maybe also RemainAfterElapse= [15:55:55] <_joe_> what mutante said :) and in general the systemd docs are pretty thorough and I go re-read them often [16:04:54] <_joe_> Amir1: doing testwikidata [16:06:11] <_joe_> I see in the script (before the transition) [16:06:13] <_joe_> 16:05:52 Wikibase\Repo\Store\Sql\SqlChangeDispatchCoordinator::selectClient: Could not lock any of the candidate client wikis for dispatching [16:06:20] <_joe_> repeated multiple times [16:07:38] <_joe_> the timer will fire in 8 minutes, I'll monitor the first run and merge the change for all wikidatas tomorrow if we're happy with the results [16:20:12] <_joe_> Amir1: the error message is confortingly the same [16:20:34] <_joe_> journalctl -u mediawiki_job_wikibase-dispatch-changes-test -f on mwmaint2002 if you want to check [16:42:01] can anybody update the puppet compiler facts for me please? I don't manage to get that through while on a train :) [16:43:20] *somebody ...maybe :) [16:53:46] <_joe_> jayme: doing it [16:54:04] nice, thanks! [16:54:27] now I fear that this is not my problem :) [16:54:34] <_joe_> well I launched it, let's see if it works [16:56:35] Sorry to trouble you. I'm getting permission errors trying to schedule downtime in Icinga. Do I need to restart the Icinga service on alert1001 after modifying the cgi.cfg? [16:56:50] I applied this but it's still not working: https://gerrit.wikimedia.org/r/c/operations/puppet/+/711171 [16:57:44] you shouldn't need to restart icinga but you may need to run puppet [16:59:08] if it still isn't working after puppet runs, something else might be afoot [16:59:13] Thanks. Yeah, I ran puppet on alert1001. Saw my change go through. I changed my name to the Wikitech username, because I thought that it didn't like my LDAP (lowercase) name. [17:00:08] oh, there's one other thing I think I half-remember about icinga, where you have to log out and back in for those changes to take effect [17:00:22] except that, as you can see, there's no "log out" button [17:00:37] oh wait strike that last, I think I'm remembering that from pre-CAS [17:00:54] <_joe_> rzl: I wouldn't be surprised if that was still needed [17:00:56] one sec, looking it up but maybe someone who remembers better can jump in in the meantime :) [17:01:00] _joe_: nod [17:01:11] I'm trying to schedule downtime from a cookbook. [17:01:18] yeah, just the procedure for logging out should be easier now, previously you had to manually delete your cookie [17:01:23] <_joe_> jayme: sorry I had several dependencies of the script missing on my system [17:01:26] I'll try the logout/login from the GUI anyway. [17:01:59] <_joe_> so jayme I think I'll leave it running now [17:01:59] _joe_: np...I'm happy already for whatever reason. Sorry to have bothered you [17:02:53] btullis: yeah sorry, I know how it sounds, but Icinga's user/login model isn't quite what you'd expect [17:03:11] I'm not positive I have all the implications right but I *think* logout/login is what you need [17:05:41] Cool, thanks. I logged out of IDP and then back into Icinga. Downtime from the GUI now works. Downtime from the cookbook still fails, but maybe that's an issue with the cookbook. I'll ask someone else on my team to run it to see if they can confirm. [17:07:35] hm, okay -- thinking through some more how that cookbook works, it isn't interacting with your Icinga login at *all*, everything just happens as root [17:07:46] can you pastebin the output you get? [17:08:20] https://www.irccloud.com/pastebin/eOW8xmis/ [17:09:26] wuh oh, this definitely isn't your fault and it might be mine, from a recent change to the downtime cookbook [17:09:30] stand by :) [17:09:40] *downtime code in spicerack, that is [17:09:55] Cool. It just failed for razzi: too. :-) [17:11:21] * razzi waves [17:17:37] btullis: okay, I haven't found it yet but I'm pretty sure I introduced a shell quote escaping bug in https://gerrit.wikimedia.org/r/705500 [17:19:04] Cool. Thanks. [17:19:43] I'll file a task for myself to get that fixed properly, but unfortunately v.olans is the only one who can deploy a new version of spicerack and he's out all this week and next, so we'll figure out a workaround [17:20:37] OK, no worries. We can do a workaround so that we don't need the cookbook for now anyway. :+1 [17:20:46] at a guess -- try removing the apostrophe from the downtime reason (at line 68 in roll-restart-workers.py) and see if that clears it up? [17:20:51] oh sure, or that too :) [17:20:57] really sorry for the inconvenience [17:21:30] Not an issue at all. [19:43:27] legoktm[m]: thanks, I completely missed how I added subscribers and not projects [19:43:39] :p it happens [19:57:06] legoktm: got a minute? [19:57:16] yep, what's up? [19:57:28] mind if I PM? [20:39:37] Is it possible to, on Kibana Discover , or in a message list panel like most dashes have, to display messages from multiple index patterns? It seems one has to choose between logstash-* or ecs-* [20:39:48] so a query like "(type:mediawiki OR service.type:scap)" doesn't work. [20:39:59] "WARN: 0 puppet certs need to be renewed:" <3 [20:40:49] It seems like a UI limitation that can probably be bypassed in some way since afaik underneath there are many separate indexes already, that can have differnet configuration etc so for elastic I would think it's no more difficult to also query across those two the same way it seaches across many logstash-* indexes already. [21:09:31] Krinkle: probably not in the Discover app, but adding saved searches to a dashboard appears to work.