[01:10:00] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:10:04] <icinga-wm>	 PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 15.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[01:11:48] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321
[01:11:52] <icinga-wm>	 RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321
[07:43:29] <jynus>	 systemd timer updates are dangerous, because they trigger an execution
[08:34:16] <Emperor>	 is that documented behaviour, or a bug?
[08:48:03] <jynus>	 I don't know, but for me it is surprising behaviour- specially if I am conceptually delaying it
[09:19:12] <jynus>	 marostegui: do you need help to check dependencies (e.g. replication, etc.) for downtime?
[09:19:33] <marostegui>	 jynus: no, it should be fine, as only a couple of misc masters are pending
[09:19:36] <marostegui>	 Thanks though <3
[09:20:02] <jynus>	 I am reviewing backup hosts dependencies myself
[09:20:07] <marostegui>	 cool
[09:44:27] <jynus>	 dbprov1003 is getting a bit overloaded, checking
[09:46:06] <jynus>	 it's expected io load, but it is making things quite slow :(
[09:46:33] <jynus>	 how could mv take so much cpu cycles?
[09:46:46] <marostegui>	 disk issues?
[09:47:41] <jynus>	 sadly, host graphs regarding disk are helpful
[09:47:49] <jynus>	 *not
[09:48:36] <jynus>	 I think I am going to create new ones, there is no io ops or wait graph
[09:50:09] <jynus>	 the fact the io speeds are cpu-bound is not great
[09:50:25] <jynus>	 (on hds!)
[10:21:22] <Amir1>	 marostegui: jynus: I'm back and currently going through emails and messages, do you need anything high priority from me?
[10:21:30] <marostegui>	 nope
[10:21:57] <jynus>	 note there is dc maintenance soon FYI
[10:33:30] <Amir1>	 thanks
[10:35:39] <jynus>	 I've created https://phabricator.wikimedia.org/T329026
[14:27:11] <jynus>	 marostegui: let's talk here, there is too much going on
[14:27:35] <jynus>	 apologies for db2184, I should have cordinated it better with you
[14:27:48] <marostegui>	 nah, I just didn't see db2183 had a replica
[14:27:58] <marostegui>	 It is relatively new, no?
[14:27:59] <jynus>	 some dbs are in a gray area betwen us
[14:28:08] <jynus>	 and it is not clear who owns it
[14:28:38] <jynus>	 I think something like a dashboard could had helped coordinating
[14:36:17] <Emperor>	 ms-fe2009 needed manual intervention to get systemd happy again
[14:36:18] <jynus>	 ms-fe2012 not recovering?
[14:36:43] <jynus>	 oh, it seems new^
[14:36:49] <Emperor>	 (just needed some restarts and a reset-failed on the swift_ring_manager service)
[14:37:11] <Emperor>	 jynus: looks OK to me (just got roll-restarted, but clear in nagios...)?
[14:37:33] <jynus>	 yeah, it was in soft state just when I happened to see it
[14:37:53] <Emperor>	 'k
[15:53:49] <jynus>	 I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/886812 early- if all goes as previous times, it should trigger a new backup run
[15:59:59] <Amir1>	 jynus: if you have time, can you help me figure out what's wrong with this? https://phabricator.wikimedia.org/T328255#8573172
[16:02:57] <jynus>	 @ meeting but I can have a look later
[16:03:07] <jynus>	 please ping me on ticket or otherwise I will forget
[16:15:19] <Amir1>	 Sure. Thanks
[16:32:25] <Emperor>	 sigh, swift list [container] returns a different number of rows each time
[16:32:34] <Emperor>	 (and not a monotonic series either)
[16:46:15] <jynus>	 marostegui:  I am taking over and break db2102 for T328255
[16:46:16] <stashbot>	 T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255
[16:47:05] <marostegui>	 k thanks 
[16:47:23] <jynus>	 (I probably won't break it, but underpromise, overdeliver) :-)
[16:48:35] <jynus>	 Amir1: which section is foundationwiki, s3=
[16:48:38] <jynus>	 ?
[16:48:57] <Amir1>	 jynus: yup
[16:53:24] <jynus>	 Amir1: not 100% sure, but I belive the issue is not the type, but the NOT NULL on a column that has NULL values
[16:53:50] <jynus>	 mostly because row 10 contains a NULL on user_email
[16:54:19] <Amir1>	 aaaah, it makes more sense but why the error is "data truncated" 
[16:54:25] <Amir1>	 that's confusing 
[16:54:25] <jynus>	 one thing that could be done is DEFAULT '', but it is a bit diryt
[16:54:43] <jynus>	 yeah, it is certainly not clear it refers to the NULL bit
[16:55:13] <Amir1>	 https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables-generated.sql#L833 how this is working 
[16:55:24] <Amir1>	 anyway, that explains it
[16:55:49] <jynus>	 Amir1: I remember some table being mistaken on code but right on production
[16:56:03] <jynus>	 but I don't remember which one was it
[16:56:23] <Amir1>	 the thing is, this is half of production, so foundation wiki in s3 most of eqiad should be correct
[16:56:31] <jynus>	 if that seems right, I will restore db2102 to a clean state
[16:56:38] <Amir1>	 yeah, thanks.
[16:56:57] <jynus>	 I told you why it fails, now fixing it... it is not as easy sadly :-DDDD
[16:57:34] <jynus>	 without context, we we don't require email for registering, it tells me that that column should be nullable
[16:57:38] <jynus>	 *as
[16:57:52] <Amir1>	 yeah, but don't worry, I dealt with FlaggedRevs drifts, this is a cakewalk in comparison 
[16:58:08] <jynus>	 but I may lack context- if it is not, DEFAULT '' is another possiblity, at least to unblock the type change
[16:58:52] <jynus>	 we should enable TRADITIONAL SQL mode on CI and be more lenient on production, that's probably why this happened
[16:59:27] <jynus>	 (app may allow NULL values there running in unsafe mode)