[05:19:43] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on x2 on db2143 is CRITICAL: 94 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2143&var-port=9104
[05:19:52] <marostegui>	 there we go
[05:20:37] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on x2 on db2144 is CRITICAL: 120.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104
[05:21:15] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on x2 on db2142 is CRITICAL: 217.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104
[06:08:23] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on x2 on db2143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2143&var-port=9104
[06:36:07] <Amir1>	 marostegui: <3 for the switchover 
[06:36:13] <marostegui>	 Amir1: <3
[07:05:39] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on x2 on db2144 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104
[07:20:34] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on x2 on db2142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104
[10:44:00] <question_mark>	 I just rescheduled the team meeting on the 27th to a new slot, please let me know if that generally works
[10:44:07] <question_mark>	 for some reason it's not asking me to also update all future meetings...
[10:44:30] <marostegui>	 Works for me
[10:44:46] <question_mark>	 perhaps because jaime owns the meeting? not sure
[10:47:03] <question_mark>	 jynus: could you try migrating ownership of the team meeting to me for now?
[10:47:12] <jynus>	 em, how?
[10:47:33] <question_mark>	 under the 'options', there's a "change owner"
[10:47:39] <question_mark>	 in the ... menu
[10:49:17] <jynus>	 did you get an email or something?
[10:49:34] <question_mark>	 yes :)
[10:49:57] <question_mark>	 ok I think that worked
[10:50:29] <jynus>	 can you remove me from individual guests now?
[10:50:47] <question_mark>	 done
[10:50:59] <jynus>	 cool
[10:52:26] <question_mark>	 sorry for the notification spam, i had to make another change to get it to work
[10:59:34] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 9.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[11:00:54] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321
[11:38:39] <Emperor>	 I think the cassandra puppet has a race or stochastic failure - aqs200{1-3} it worked eventually after a number of re-runs, but aqs2004 it just keeps failing
[11:42:59] <Emperor>	 Error: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/Package[cassandra/logstash-logback-encoder]: Provider scap3 is not functional on this host
[11:51:31] <Emperor>	 https://phabricator.wikimedia.org/P29736 full puppet run; it's not very informative other than something is wrong with scap3 :-/
[12:32:19] <Emperor>	 Info: Applying configuration version '(92ef73694e) Muehlenhoff - scap: remove scap Debian package from targets' seems to have fixed it...
[12:32:52] <Emperor>	 Oh, no, different failure
[12:33:57] <Emperor>	 then it works.
[12:34:36] <Emperor>	 but I think removing the scap package was key to it, so I think moritzm's change might have fixed my problem...
[12:35:57] <moritzm>	 scap is still present, though
[12:36:15] <moritzm>	 just no longer as a deb
[12:37:16] <Emperor>	 Mmm, but presumably in a manner that meant puppet could drive it usefully
[12:59:43] <jynus>	 mydumper streaming backups! https://www.percona.com/blog/mydumper-stream-implementation/
[13:19:06] <btullis>	 Emperor: Sorry to trouble you, are you working on the aqs2* nodes at the moment? If so, would it be possible to downtime the services? We're getting quite a lot of Icinga noise in #wikimedia-analytics from them.
[13:21:33] <Emperor>	 btullis: I'm re-imaging the new ones back to buster; that process should downtime them, I think, although the services themselves will get sorted when urandom comes online later
[13:22:10] <Emperor>	 he can't reimage them, so I'm trying to get them all reimaged so he can get them working during his working day, IYSWIM
[13:22:23] <Emperor>	 It would be easier if puppet actually worked on these nodes, though :(
[13:22:53] <Emperor>	 moritzm: I think it was a red herring, aqs2005 is back to failing in the same way :(
[13:25:42] <btullis>	 Cool, thanks. I definitely don't want to get in the way of your work. I've just enabled more contactgroups in Icinga for a bunch of hosts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/804593) and it just happens that we're getting a bit spammed. I can bulk download the services on these hosts from the icinga UI if it helps.
[13:25:58] <btullis>	 s/download/downtime/
[13:26:18] <Emperor>	 that might be useful (though I thought the reimage cookbook handled downtiming hosts)
[13:28:55] <Emperor>	 scap> I'm guessing the problem is that /var/lib/scap is empty (so /usr/bin/scap is a dangling symlink)
[13:30:57] <Emperor>	 I'm not sure how that's meant to be deployed - is there some push process from a deploy-master tha needs to happen?
[13:31:17] <btullis>	 Thanks. I added 48 hours of downtime for all services on aqs2* hosts.
[13:32:02] <Emperor>	 moritzm: do you know how /var/lib/scap/scap is meant to get populated?
[13:33:57] <btullis>	 Oh, reading your scrollback, I remember having an issue relating to logstash-logback-encoder on Cassandra. See if this helps at all: https://phabricator.wikimedia.org/T297460#7601771
[13:36:11] <moritzm>	 it gets triggered from the deployment hosts, the list of hosts to deploy to are retrieved from puppetdb
[13:36:19] <Emperor>	 btullis: AFAICT when it works, something on the target machine has run scap-deploy local - but it can't do that without a working /usr/bin/scap which is a symlink to /var/lib/scap/scap/bin/scap
[13:36:22] <moritzm>	 anything which uses the scap classes
[13:36:56] <moritzm>	 can you ping jnuche? he wrote the new setup, there was a remaining issue with the bootstrapping he mentioned
[13:37:01] <moritzm>	 that's possibly related
[13:37:13] <Emperor>	 moritzm: ah, yes, this feels like a bootstrap problem
[14:23:30] <Emperor>	 volans: is "in about 5 minutes once Emperor has made some tea" a good time to talk about what-next for the ssd-fettling cookbook?
[14:23:52] <volans>	 Emperor: sure, why not. Works for me
[14:24:09] <Emperor>	 cool
[14:24:13] <Emperor>	 brb :)