[00:20:26] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [00:22:14] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [03:04:07] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [03:12:16] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [04:34:16] 10DBA, 10wikitech.wikimedia.org, 10User-Ladsgroup, 10cloud-services-team (Kanban): wikitech database has almost all of its varbinary fields wrong - https://phabricator.wikimedia.org/T269348 (10Marostegui) Thank you!! [04:40:33] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Marostegui) Not sure if this was maintenance or not, but this host rebooted again around 9h ago. ` root@db2100:~# uptime 04:40:17 up 9:28, 1 user, l... [04:41:12] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10Marostegui) >>! In T280979#7129813, @jcrespo wrote: > Followup to our conversation, Re: T280979#7119805 > > @Marostegui Recently changed/setup/rebuilt hosts: >... [04:48:39] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Marostegui) 05Resolved→03Open It happened again: ` hpiLO-> show record35 status=0 status_tag=COMMAND COMPLETED Thu Jun 3 04:47:53 2... [04:50:35] marostegui: btw https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?viewPanel=19&orgId=1&var-metric=p50&var-module=options&from=now-24h&to=now and https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?viewPanel=19&orgId=1&from=now-24h&to=now&var-metric=p99&var-module=options [04:51:03] I assume this handles T280220 as well [04:51:03] T280220: Error "Lock wait timeout exceeded" from User::loadFromDatabase (via API action=options) - https://phabricator.wikimedia.org/T280220 [04:53:01] oh wow [04:53:03] nice one [04:55:51] I wish I did it sooner :( [04:56:06] summer gift! [07:10:33] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) Hey, @Papaul, can you check this? If the stick was bad (and so not recognized/enabled), I wouldn't expect it to reboot again. However, apparent... [07:11:39] 10Data-Persistence-Backup, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) > MySQL isn't started here see: T283995#7131062 It was supposed to :-(. [09:06:11] 10DBA, 10Patch-For-Review: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10Marostegui) s6 hasn't given any issues, so maybe we can start working on this next week (after 3 weeks since we switched s6) and attempt to do the switchover the 17th? @Kormat thoughts? [09:07:31] 10DBA, 10Patch-For-Review: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 (10Marostegui) I am going to start working on this next week [09:09:27] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [09:09:29] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [09:09:32] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [09:23:16] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) s1 eqiad [] dbstore1003 [x] db1184 [x] db1169 [x] db1164 [] db1163 [x] db1154 [x] db1140 [x] db1139 [x] db1135 [x] db1134 [x] db1133 [x] db1119 [x] db1118 [x] db1106... [09:23:18] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) s1 eqiad [] dbstore1003 [x] db1184 [x] db1169 [x] db1164 [] db1163 [x] db1154 [x] db1140 [x] db1139 [x] db1135 [x] db1134 [x] db1133 [x] db1119 [x] db1118 [x] db1106... [09:23:21] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) s1 eqiad [] dbstore1003 [x] db1184 [x] db1169 [x] db1164 [] db1163 [x] db1154 [x] db1140 [x] db1139 [x] db1135 [x] db1134 [x] db1133 [x] db1119 [x] db1118 [x] db1106... [09:23:34] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [09:23:39] 10Data-Persistence-Backup, 10database-backups, 10Goal, 10Patch-For-Review: Upgrade pending stretch backup hosts to buster - https://phabricator.wikimedia.org/T280979 (10jcrespo) [09:23:46] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [09:23:58] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [10:08:52] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [10:08:53] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [10:08:57] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [10:17:11] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) s3 eqiad [x] dbstore1004 [] db1179 [] db1175 [x] db1171 [] db1166 [] db1157 [] db1154 [] db1123 [] db1112 [x] db1102 [] clouddb1021 [] clouddb1017 [] clouddb1013 [10:17:13] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) s3 eqiad [x] dbstore1004 [] db1179 [] db1175 [x] db1171 [] db1166 [] db1157 [] db1154 [] db1123 [] db1112 [x] db1102 [] clouddb1021 [] clouddb1017 [] clouddb1013 [10:17:21] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) s3 eqiad [x] dbstore1004 [] db1179 [] db1175 [x] db1171 [] db1166 [] db1157 [] db1154 [] db1123 [] db1112 [x] db1102 [] clouddb1021 [] clouddb1017 [] clouddb1013 [10:17:47] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10Parsoid (Tracking), 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Kormat) pc1010 is now pc2 primary, and is no longer replicating from pc1008: ` root@pc1010.eqiad.wm... [10:46:27] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [10:58:04] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:00:52] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [11:05:18] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:32:24] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:34:12] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:50:30] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:51:32] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [11:54:08] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [11:56:56] PROBLEM - MariaDB sustained replica lag on pc2009 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [12:05:05] there seems to be some small lag on m1-codfw [12:05:44] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 25 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [12:06:33] some heavy process, now gone: https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=misc&var-shard=m1&var-role=All&from=1622720865345&to=1622721960172&viewPanel=7 [12:14:48] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [12:18:38] RECOVERY - MariaDB sustained replica lag on pc2009 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2009&var-port=9104 [12:29:50] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [12:30:07] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [12:30:55] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [12:42:42] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10Papaul) First error ` Memory Error Threshold Exceeded (Processor 1, DIMM 5) ` second error ` Uncorrectable Memory Error (Processor 1, DIMM 6) ` Third e... [12:44:54] 10DBA, 10Analytics-Clusters, 10Analytics-Kanban, 10Patch-For-Review: dbstore1004 85% disk space used. - https://phabricator.wikimedia.org/T283125 (10Kormat) [12:44:57] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Kormat) 05Open→03Resolved "just" done :) It's back in tendril+zarcillo, and is a replica of db1124. [12:45:55] 10DBA, 10Data-Persistence-Backup, 10ops-codfw: db2100 rebooted, mysqld alerted after to say it hadn't started - https://phabricator.wikimedia.org/T283995 (10jcrespo) :-( Thank you, Papaul! [12:46:37] 10DBA: Re-image (rename) dbstore1006 into db1125 - https://phabricator.wikimedia.org/T284128 (10Marostegui) Thank you! PS: Orchestrator detected it automatically too! <3 [13:22:44] marostegui: I have more schema changes for you to assign to kormat [13:22:50] T279982 [13:22:51] T279982: Add index on oi_timestamp - https://phabricator.wikimedia.org/T279982 [13:22:56] creating its ticket now [13:23:01] hahaha [13:23:14] image table? [13:23:18] RIP [13:23:19] oldimage [13:23:23] ah good [13:23:25] it is not _that_ big [13:23:43] oh and also, we made a lot of progress with the image table mess [13:23:51] can I truncate it now? [13:23:53] haven't been merged yet but there's a huge patch by Tim [13:24:09] sure then blame it on kormat [13:24:14] hahaha [13:24:24] 🥀 [13:24:35] so, how much % you think we can remove from it? [13:25:12] the latest estimation was around 90% [13:25:19] oh my... [13:25:56] Amir1: so a truncate would only remove 10% more than desired. that's pretty close, really. [13:26:18] hahaha [13:26:33] :D [13:26:48] Amir1: do you have any rough ETA on when all this can happen? [13:27:16] I hope to get it started the week after but it'll take a while to be properly cleaned up [13:27:24] Amir1: and most importantly, where's that 90% going? [13:27:29] ES [13:27:33] but compressed [13:27:54] <3 [13:27:55] basically my old patch but cleaner [13:28:01] hahaha [13:28:14] this is a very great improvement [13:28:24] The image table is impossible to handle anymore [13:28:29] nah, we still need to clean up links table [13:28:48] yeah, those are great monsters too [13:28:50] then I can sleep without nightmares [13:29:32] I sleep a lot better since wb_terms is gone, to be honest [13:33:29] https://usercontent.irccloud-cdn.com/file/g0khNmgK/image.png [13:33:32] marostegui: ^ [13:33:50] hahahahahaha [13:35:58] 10Blocked-on-schema-change, 10DBA: SChema change for adding oi_timestamp on oldimage table - https://phabricator.wikimedia.org/T284221 (10Ladsgroup) [13:36:09] 10Blocked-on-schema-change, 10DBA: Schema change for adding oi_timestamp on oldimage table - https://phabricator.wikimedia.org/T284221 (10Ladsgroup) [13:45:34] PROBLEM - MariaDB sustained replica lag on db2132 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [13:47:24] RECOVERY - MariaDB sustained replica lag on db2132 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db2132&var-port=9104 [13:53:11] I think that's the best meme I have seen in a long time [13:55:36] :D [13:56:31] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [14:05:32] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [15:06:43] kormat: can you take a quick look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/697618 ? [15:06:49] :) [15:19:09] PROBLEM - MariaDB sustained replica lag on pc2010 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [15:20:44] ottomata: hey. it's been on my radar, but with the percona training this week things have been a bit hectic. i'll make sure i get to it tomorrow morning [15:20:52] oh right! [15:21:02] ok kormat actually no hurry on it its ok! [15:21:47] ok cool 👍 [15:30:32] RECOVERY - MariaDB sustained replica lag on pc2010 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2010&var-port=9104 [16:21:32] PROBLEM - MariaDB sustained replica lag on pc2007 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [16:23:18] RECOVERY - MariaDB sustained replica lag on pc2007 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2007&var-port=9104 [22:07:22] 10Data-Persistence-Backup, 10database-backups, 10Patch-For-Review, 10cloud-services-team (Kanban): Use mariabackup instead of xtrabackup for galera backups? (Or possibly for all maria backups?) - https://phabricator.wikimedia.org/T284157 (10Bstorm) If I'm reading what @jcrespo said correctly, it would pro...