[01:10:00] PROBLEM - MariaDB sustained replica lag on m1 on db2160 is CRITICAL: 6.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:10:04] PROBLEM - MariaDB sustained replica lag on m1 on db1117 is CRITICAL: 15.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [01:11:48] RECOVERY - MariaDB sustained replica lag on m1 on db2160 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2160&var-port=13321 [01:11:52] RECOVERY - MariaDB sustained replica lag on m1 on db1117 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1117&var-port=13321 [07:43:29] systemd timer updates are dangerous, because they trigger an execution [08:34:16] is that documented behaviour, or a bug? [08:48:03] I don't know, but for me it is surprising behaviour- specially if I am conceptually delaying it [09:19:12] marostegui: do you need help to check dependencies (e.g. replication, etc.) for downtime? [09:19:33] jynus: no, it should be fine, as only a couple of misc masters are pending [09:19:36] Thanks though <3 [09:20:02] I am reviewing backup hosts dependencies myself [09:20:07] cool [09:44:27] dbprov1003 is getting a bit overloaded, checking [09:46:06] it's expected io load, but it is making things quite slow :( [09:46:33] how could mv take so much cpu cycles? [09:46:46] disk issues? [09:47:41] sadly, host graphs regarding disk are helpful [09:47:49] *not [09:48:36] I think I am going to create new ones, there is no io ops or wait graph [09:50:09] the fact the io speeds are cpu-bound is not great [09:50:25] (on hds!) [10:21:22] marostegui: jynus: I'm back and currently going through emails and messages, do you need anything high priority from me? [10:21:30] nope [10:21:57] note there is dc maintenance soon FYI [10:33:30] thanks [10:35:39] I've created https://phabricator.wikimedia.org/T329026 [14:27:11] marostegui: let's talk here, there is too much going on [14:27:35] apologies for db2184, I should have cordinated it better with you [14:27:48] nah, I just didn't see db2183 had a replica [14:27:58] It is relatively new, no? [14:27:59] some dbs are in a gray area betwen us [14:28:08] and it is not clear who owns it [14:28:38] I think something like a dashboard could had helped coordinating [14:36:17] ms-fe2009 needed manual intervention to get systemd happy again [14:36:18] ms-fe2012 not recovering? [14:36:43] oh, it seems new^ [14:36:49] (just needed some restarts and a reset-failed on the swift_ring_manager service) [14:37:11] jynus: looks OK to me (just got roll-restarted, but clear in nagios...)? [14:37:33] yeah, it was in soft state just when I happened to see it [14:37:53] 'k [15:53:49] I am going to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/886812 early- if all goes as previous times, it should trigger a new backup run [15:59:59] jynus: if you have time, can you help me figure out what's wrong with this? https://phabricator.wikimedia.org/T328255#8573172 [16:02:57] @ meeting but I can have a look later [16:03:07] please ping me on ticket or otherwise I will forget [16:15:19] Sure. Thanks [16:32:25] sigh, swift list [container] returns a different number of rows each time [16:32:34] (and not a monotonic series either) [16:46:15] marostegui: I am taking over and break db2102 for T328255 [16:46:16] T328255: Clean up core schema drifts in codfw - https://phabricator.wikimedia.org/T328255 [16:47:05] k thanks [16:47:23] (I probably won't break it, but underpromise, overdeliver) :-) [16:48:35] Amir1: which section is foundationwiki, s3= [16:48:38] ? [16:48:57] jynus: yup [16:53:24] Amir1: not 100% sure, but I belive the issue is not the type, but the NOT NULL on a column that has NULL values [16:53:50] mostly because row 10 contains a NULL on user_email [16:54:19] aaaah, it makes more sense but why the error is "data truncated" [16:54:25] that's confusing [16:54:25] one thing that could be done is DEFAULT '', but it is a bit diryt [16:54:43] yeah, it is certainly not clear it refers to the NULL bit [16:55:13] https://github.com/wikimedia/mediawiki/blob/master/maintenance/tables-generated.sql#L833 how this is working [16:55:24] anyway, that explains it [16:55:49] Amir1: I remember some table being mistaken on code but right on production [16:56:03] but I don't remember which one was it [16:56:23] the thing is, this is half of production, so foundation wiki in s3 most of eqiad should be correct [16:56:31] if that seems right, I will restore db2102 to a clean state [16:56:38] yeah, thanks. [16:56:57] I told you why it fails, now fixing it... it is not as easy sadly :-DDDD [16:57:34] without context, we we don't require email for registering, it tells me that that column should be nullable [16:57:38] *as [16:57:52] yeah, but don't worry, I dealt with FlaggedRevs drifts, this is a cakewalk in comparison [16:58:08] but I may lack context- if it is not, DEFAULT '' is another possiblity, at least to unblock the type change [16:58:52] we should enable TRADITIONAL SQL mode on CI and be more lenient on production, that's probably why this happened [16:59:27] (app may allow NULL values there running in unsafe mode)