[00:10:11] mutante: fortunately we've already ripped those out and replaced with systemd timers :D [00:10:44] ryankemper: yay 👏 [08:29:44] FYI, I need to stop Puppet in codfw for ~ 5 minutes, will start in a few [08:38:07] and re-enabled [09:12:09] <_joe_> I'm about to disable the HTTP lvs endpoint for the mediawiki API, leaving only encrypted calls enabled. I'll start from codfw and nothing should really go wrong, but if you see something say something [09:15:45] <_joe_> vgutierrez: I see in eqiad we have 8 lvs hosts; do you happen to know which ones are currently "active"? [09:16:19] _joe_: /^lvs10(1[789]|20)\.eqiad\.wmnet$/ [09:16:25] the new ones [09:17:51] <_joe_> ack [10:41:57] <_joe_> going to do the same dance, but for appservers [11:39:02] <_joe_> talking of incident response, is anyone interested in a replay of yesterday's page/small db outage? [11:39:20] <_joe_> I'm happy to walk through the troubleshooting procedure [11:39:28] * apergos raises a hand for when you do it [11:40:52] <_joe_> Emperor / jhathaway / arnoldokoth / inflatador / btullis, since you're the last people to join, explicit ping about ^^ [11:41:22] <_joe_> (I am offering, not asking to attend :P) [11:45:39] _joe_: yes, thanks, I think that would be helpful, if scheduling works OK :) [12:06:03] _joe_: Yes please. [12:06:41] <_joe_> ack :) [12:08:12] Apologies for the delay in responding. Investigating an ongoing incident (T302777) but it's non-critical right now. [12:08:13] T302777: The analytics-meta's binlogs are full of airflow sla-related UPDATE statements - https://phabricator.wikimedia.org/T302777 [12:09:36] <_joe_> btullis: hey no problem, half of the people I pinged won't be online until much later [12:10:05] <_joe_> ouch good luck with that btw [12:11:21] Thanks. Got a few days head start on it, thanks to elukey. [12:16:12] _joe_: please include me on the invite if you're running that session. thanks :) [12:18:06] <_joe_> sure. I think I'll try to schedule it for thursday given tomorrow we have the tech dept meeting around the time both this side of the atlantic and the US are around, but tbh I might even do two versions if there is enough ~ gmt attendance [12:18:39] <_joe_> I am not a fan of trying to teach stuff to people around their 8th hour of work [12:18:54] I'd also be interested [12:42:31] _joe_: +1 [14:01:38] _joe_ for sure, do invite me when you get a chance [14:03:38] mutante I'm interested in the gitlab stuff too, looks like I'm already logged in w/wikitech creds. Exciting [14:45:36] _joe_: I would be interested as well [15:13:04] <_joe_> ok I'll probably set up two timeslots [15:52:54] <_joe_> there will be spam of recoveries in #operations, but now it's safe to deploy mediawiki [15:53:14] <_joe_> I gotta step afk for ~ 20 minutes soon, but everything is back to normal AFAICT [16:02:19] backups on drmrs rerun sucessfully now BTW [16:14:10] _joe_: not seeing any recovery [16:15:01] It's stopped sending new but the old ones haven't spammed ok [16:15:51] <_joe_> RhinosF1: it's not a fast check [16:16:05] understood [16:16:11] <_joe_> but also I think it should have recovered over ~ 30 minutes [16:19:17] One just came in [16:20:19] <_joe_> yeah I forced the recheck [16:35:58] inflatador: let me know if you want to import some existing repo that should not be under a personal user name [16:50:10] <_joe_> anyone else in a US-like TZ interested in following a replay of the db/appservers outage we had the other day? [16:55:14] _joe_: yep [16:55:34] is the ipmiseld.service issue known? ongoing deploy? [16:56:34] jynus: the code was just merged in? https://gerrit.wikimedia.org/r/c/operations/puppet/+/766848 [16:56:40] what is the issue? [16:56:53] I was trying to find the related patch, thank you! [16:57:31] a lot of "CRITICAL - degraded: The following units failed: ipmiseld.service" in soft [16:57:42] ah! [16:58:26] now hard :-( [16:59:22] ah ok [16:59:35] "Error creating SDR cache '/var/cache/ipmiseld//ipmiseldsdrcache.localhost': filename invalid" [16:59:54] looks like something is broken with the default config [17:00:16] https://gerrit.wikimedia.org/r/c/operations/puppet/+/766848 [17:00:21] ^ seems related [17:01:06] herron: do you want to fix the config, or remove the package for now? [17:02:55] jhathaway: I think we may as well try to fix it [17:05:10] herron: looks like this old bug, https://bugs.launchpad.net/ubuntu/+source/freeipmi/+bug/1912347 [17:05:23] i.e. supporting stretch is no fun [17:06:04] yeah seems it's as simple as mkdir /var/cache/ipmiseld ? we could have puppet ensure this exists [17:06:29] yeah [17:15:07] hey godog moritzm could I bug you for a quick review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/767223 to address the alert shower in #-operations [17:29:14] or if anyone else is around who could give a +1? that's meant to create /var/cache/ipmiseld across the metal hosts in the fleet and I'd prefer not to self merge a fleet wide patch [17:32:35] cwhite cdanis ty! [17:37:53] we got a bunch of pages, is there any way to prevent/ack all of them on alertmanager to avoid them from paging on splunk? [17:42:31] <_joe_> splunk? [17:42:40] <_joe_> oh right victorops [17:42:55] <_joe_> sorry I'm used to "splunk" being their main product [17:43:01] "Hello, My Name is Victor Operations" [17:43:03] <_joe_> and I was utterly confused [17:43:29] yep, that one [17:44:08] My emails are like "03/01 18:38 VictorOps [Splunk On-Call] 2 incidents..." so well xd [18:30:34] herron: lgtm, I also found https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=793186 so that's also fixed in the Debian package for the next release [18:47:18] nice, thanks moritzm