[05:19:43] PROBLEM - MariaDB sustained replica lag on x2 on db2143 is CRITICAL: 94 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2143&var-port=9104 [05:19:52] there we go [05:20:37] PROBLEM - MariaDB sustained replica lag on x2 on db2144 is CRITICAL: 120.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104 [05:21:15] PROBLEM - MariaDB sustained replica lag on x2 on db2142 is CRITICAL: 217.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104 [06:08:23] RECOVERY - MariaDB sustained replica lag on x2 on db2143 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2143&var-port=9104 [06:36:07] marostegui: <3 for the switchover [06:36:13] Amir1: <3 [07:05:39] RECOVERY - MariaDB sustained replica lag on x2 on db2144 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2144&var-port=9104 [07:20:34] RECOVERY - MariaDB sustained replica lag on x2 on db2142 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2142&var-port=9104 [10:44:00] I just rescheduled the team meeting on the 27th to a new slot, please let me know if that generally works [10:44:07] for some reason it's not asking me to also update all future meetings... [10:44:30] Works for me [10:44:46] perhaps because jaime owns the meeting? not sure [10:47:03] jynus: could you try migrating ownership of the team meeting to me for now? [10:47:12] em, how? [10:47:33] under the 'options', there's a "change owner" [10:47:39] in the ... menu [10:49:17] did you get an email or something? [10:49:34] yes :) [10:49:57] ok I think that worked [10:50:29] can you remove me from individual guests now? [10:50:47] done [10:50:59] cool [10:52:26] sorry for the notification spam, i had to make another change to get it to work [10:59:34] PROBLEM - MariaDB sustained replica lag on m1 on db2078 is CRITICAL: 9.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:00:54] RECOVERY - MariaDB sustained replica lag on m1 on db2078 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2078&var-port=13321 [11:38:39] I think the cassandra puppet has a race or stochastic failure - aqs200{1-3} it worked eventually after a number of re-runs, but aqs2004 it just keeps failing [11:42:59] Error: /Stage[main]/Cassandra::Logging/Scap::Target[cassandra/logstash-logback-encoder]/Package[cassandra/logstash-logback-encoder]: Provider scap3 is not functional on this host [11:51:31] https://phabricator.wikimedia.org/P29736 full puppet run; it's not very informative other than something is wrong with scap3 :-/ [12:32:19] Info: Applying configuration version '(92ef73694e) Muehlenhoff - scap: remove scap Debian package from targets' seems to have fixed it... [12:32:52] Oh, no, different failure [12:33:57] then it works. [12:34:36] but I think removing the scap package was key to it, so I think moritzm's change might have fixed my problem... [12:35:57] scap is still present, though [12:36:15] just no longer as a deb [12:37:16] Mmm, but presumably in a manner that meant puppet could drive it usefully [12:59:43] mydumper streaming backups! https://www.percona.com/blog/mydumper-stream-implementation/ [13:19:06] Emperor: Sorry to trouble you, are you working on the aqs2* nodes at the moment? If so, would it be possible to downtime the services? We're getting quite a lot of Icinga noise in #wikimedia-analytics from them. [13:21:33] btullis: I'm re-imaging the new ones back to buster; that process should downtime them, I think, although the services themselves will get sorted when urandom comes online later [13:22:10] he can't reimage them, so I'm trying to get them all reimaged so he can get them working during his working day, IYSWIM [13:22:23] It would be easier if puppet actually worked on these nodes, though :( [13:22:53] moritzm: I think it was a red herring, aqs2005 is back to failing in the same way :( [13:25:42] Cool, thanks. I definitely don't want to get in the way of your work. I've just enabled more contactgroups in Icinga for a bunch of hosts (https://gerrit.wikimedia.org/r/c/operations/puppet/+/804593) and it just happens that we're getting a bit spammed. I can bulk download the services on these hosts from the icinga UI if it helps. [13:25:58] s/download/downtime/ [13:26:18] that might be useful (though I thought the reimage cookbook handled downtiming hosts) [13:28:55] scap> I'm guessing the problem is that /var/lib/scap is empty (so /usr/bin/scap is a dangling symlink) [13:30:57] I'm not sure how that's meant to be deployed - is there some push process from a deploy-master tha needs to happen? [13:31:17] Thanks. I added 48 hours of downtime for all services on aqs2* hosts. [13:32:02] moritzm: do you know how /var/lib/scap/scap is meant to get populated? [13:33:57] Oh, reading your scrollback, I remember having an issue relating to logstash-logback-encoder on Cassandra. See if this helps at all: https://phabricator.wikimedia.org/T297460#7601771 [13:36:11] it gets triggered from the deployment hosts, the list of hosts to deploy to are retrieved from puppetdb [13:36:19] btullis: AFAICT when it works, something on the target machine has run scap-deploy local - but it can't do that without a working /usr/bin/scap which is a symlink to /var/lib/scap/scap/bin/scap [13:36:22] anything which uses the scap classes [13:36:56] can you ping jnuche? he wrote the new setup, there was a remaining issue with the bootstrapping he mentioned [13:37:01] that's possibly related [13:37:13] moritzm: ah, yes, this feels like a bootstrap problem [14:23:30] volans: is "in about 5 minutes once Emperor has made some tea" a good time to talk about what-next for the ssd-fettling cookbook? [14:23:52] Emperor: sure, why not. Works for me [14:24:09] cool [14:24:13] brb :)