[00:54:42] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [01:06:54] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [01:10:24] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [01:20:50] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [01:34:48] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 21.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [03:16:02] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [03:31:46] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 3.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [03:38:44] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [04:05:02] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 21.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [04:13:48] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [04:40:08] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [04:50:56] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) s2 eqiad [x] dbstore1004 [] db1182 [x] db1171 [] db1170 [] db1162 [] db1156 [] db1155 [x] db1146 [] db1129 [] db1122 [x] db1105 [x] db1102 [] clouddb1021 [] clouddb1... [04:50:59] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) s2 eqiad [x] dbstore1004 [] db1182 [x] db1171 [] db1170 [] db1162 [] db1156 [] db1155 [x] db1146 [] db1129 [] db1122 [x] db1105 [x] db1102 [] clouddb1021 [] clouddb1... [04:51:02] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) s2 eqiad [x] dbstore1004 [] db1182 [x] db1171 [] db1170 [] db1162 [] db1156 [] db1155 [x] db1146 [] db1129 [] db1122 [x] db1105 [x] db1102 [] clouddb1021 [] clouddb1... [04:56:15] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [04:56:25] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [04:56:35] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [04:57:18] PROBLEM - MariaDB sustained replica lag on db1129 is CRITICAL: 4.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1129&var-port=9104 [04:57:20] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [04:58:18] RECOVERY - MariaDB sustained replica lag on db1129 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=db1129&var-port=9104 [04:59:00] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [04:59:02] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [04:59:04] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [05:00:30] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [05:06:26] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.8 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [05:40:23] 10DBA, 10Patch-For-Review: Upgrade s3 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283131 (10Marostegui) Upgraded all s3 codfw to 10.4.19 [07:04:35] 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) a:05jcrespo→03None The backup finished, JobId=338470: ` Elapsed time: 14 hours 53 mins 5 secs SD Files Written: 6,117,027 SD Bytes Written:... [07:22:18] 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Ladsgroup) It sounds good to me. [07:24:16] 10Data-Persistence-Backup, 10SRE, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) Could you give me some meaningful restore operation (subdir). I am guessing recovering all will not be wanted because of time and space available. I can recover... [07:41:44] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10jcrespo) Thanks, that analysis is very useful. I feel we are making lots of progress already on understan... [08:08:10] 10DBA, 10MediaWiki-Parser, 10Performance-Team, 10Parsoid (Tracking), 10Patch-For-Review: purgeParserCache.php should not take over 24 hours for its daily run - https://phabricator.wikimedia.org/T282761 (10Kormat) The purge has finished as of 2021-05-26T06:00Z. I'll start the optimize process now. [08:34:08] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo) [08:38:52] 10Data-Persistence-Backup, 10SRE, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Hi Jamie, Thanks for the feedback. I think given the desire to push the WAN links relatively h... [08:39:31] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10jcrespo) a:05jcrespo→03Kormat AFAICS, all of s6 is in buster/10.4: {F34469117} [08:47:27] 10DBA, 10Patch-For-Review: Upgrade s6 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T280751 (10Kormat) 05Open→03Resolved 🎉 [08:58:32] 10Data-Persistence-Backup: Backup alert email notification - https://phabricator.wikimedia.org/T283017 (10jcrespo) This is not super urgent, but @LSobanski if you could think of more details of what kind of information exactly you would like to have more accessible about database backups specifically (we will al... [09:00:34] 10Data-Persistence-Backup: Backup alert email notification - https://phabricator.wikimedia.org/T283017 (10jcrespo) Offtopic, but something to think in the future is also how we want to present/integrate metadata about the several worklows of backups (database backups vs general backups (bacula) vs media backups... [09:02:02] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) s1 is fully done, only pending the master [09:02:05] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) s1 is fully done, only pending the master [09:02:08] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) s1 is fully done, only pending the master [09:02:32] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of img_timestamp and making it binary(14) - https://phabricator.wikimedia.org/T273360 (10Marostegui) [09:02:45] 10Blocked-on-schema-change, 10DBA: Schema change for watchlist.wl_notificationtimestamp going binary(14) from varbinary(14) - https://phabricator.wikimedia.org/T268392 (10Marostegui) [09:03:04] 10Blocked-on-schema-change, 10DBA: Schema change to turn user_last_timestamp.user_newtalk to binary(14) - https://phabricator.wikimedia.org/T266486 (10Marostegui) [09:48:21] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of user_touched - https://phabricator.wikimedia.org/T282373 (10Marostegui) [09:48:23] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of page_touched - https://phabricator.wikimedia.org/T282372 (10Marostegui) [09:48:27] 10Blocked-on-schema-change, 10DBA: Schema change for dropping default of ar_timestamp - https://phabricator.wikimedia.org/T282371 (10Marostegui) [09:55:52] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [10:01:22] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [10:24:08] 10Data-Persistence-Backup: Backup alert email notification - https://phabricator.wikimedia.org/T283017 (10Marostegui) >>! In T283017#7114556, @jcrespo wrote: > This is not super urgent, but @LSobanski if you could think of more details of what kind of information exactly you would like to have more accessible ab... [10:30:28] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Marostegui) [10:40:32] 10DBA, 10Patch-For-Review: Upgrade s5 to Debian Buster and MariaDB 10.4 - https://phabricator.wikimedia.org/T283235 (10Marostegui) All s5 codfw buster hosts upgraded to 10.4.19 [10:49:44] kormat: can you upgrade mysql on pc1007 before repooling it? [11:10:43] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 3.8 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [11:15:23] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [11:32:31] marostegui: i certainly could [12:11:40] * sobanski stepping out for a while [12:47:21] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10Ladsgroup) Something like `/var/lib/mailman/archives/private/cloud-announce.mbox/cloud-announce.mbox` and `/var/lib/mailman/lists/cloud-announce/config.pck` would be great. [13:31:04] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [13:32:54] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [13:56:36] 10DBA, 10Data-Services, 10Toolforge, 10Tracking-Neverending: Certain tools users create multiple long running queries that take all memory and/or CPU from labsdb hosts, slowing it down and potentially crashing (tracking) - https://phabricator.wikimedia.org/T119601 (10Pathoschild) [15:15:37] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Papaul) a:05Papaul→03Marostegui @Marostegui disk replaced [15:20:59] 10Data-Persistence-Backup, 10Wikimedia-Mailing-lists: The Great Clean Up of Mailman2 - https://phabricator.wikimedia.org/T282303 (10jcrespo) The recovery as requested has been scheduled. FYI, there were other files inside /var/lib/mailman/lists/cloud-announce/, but were not marked for recovery. The files reco... [15:27:35] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) Thanks @Papaul - however the disk doesn't look to be rebuilding: ` seqNum: 0x000002e2 Time: Wed May 26 15:14:34 2021 Code: 0x000000b9 Class: 2 Locale: 0x04 Event Description: Enclosure PD 20(c None/... [15:27:44] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) a:05Marostegui→03Papaul [15:28:17] 10DBA: db2107 idrac not responding - https://phabricator.wikimedia.org/T283727 (10Marostegui) [15:28:22] ^cause replacing a disk would be just too easy :( [15:29:46] 10DBA: db2107 idrac not responding - https://phabricator.wikimedia.org/T283727 (10Marostegui) Seems to be working locally, so it might need a reset [15:34:47] 10DBA: db2107 idrac not responding - https://phabricator.wikimedia.org/T283727 (10Marostegui) p:05Triage→03Medium [15:35:52] 10DBA: db2107 idrac not responding - https://phabricator.wikimedia.org/T283727 (10Marostegui) 05Open→03Resolved a:03Marostegui A cold reset worked: `# ipmitool -I lanplus -H db2107.mgmt.codfw.wmnet -U root -E mc reset cold Unable to read password from environment Password: Sent cold reset command to MC` H... [15:35:55] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) [16:10:43] going to reboot s3 codfw master (db2107) for onsite maintenance [16:13:02] ^ that was s2 master [16:19:55] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) After the reboot I can see the disk: ` Raw Size: 1.746 TB [0xdf8fe2b0 Sectors] Non Coerced Size: 1.745 TB [0xdf7fe2b0 Sectors] Coerced Size: 1.745 TB [0xdf7c0000 Sectors] Sector Size: 512 Logical Se... [16:41:32] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Papaul) a:05Papaul→03Marostegui [16:48:07] db2107 has mysql back up [16:48:17] 10DBA, 10SRE, 10ops-codfw: Degraded RAID on db2107 - https://phabricator.wikimedia.org/T282072 (10Marostegui) This keeps progressing well: ` root@db2107:~# megacli -pdrbld -showprog -physdrv\[32:5\] -aALL Rebuild Progress on Device at Enclosure 32, Slot 5 Completed 37% in 31 Minutes. ` [16:48:28] the disk keeps progressing well ^ [16:48:35] I am going offline, it's been a hard day already [16:48:54] 10DBA, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) [16:49:06] 10DBA, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Papaul) p:05Triage→03Medium [16:50:35] 10DBA, 10ops-codfw: codfw: db2079 memory issue on DIMM B8 - https://phabricator.wikimedia.org/T283743 (10Marostegui) This is s8 master, so it needs some coordination. Let me know a day/time when you'd like to tackle this and I can have the host ready for you! [18:09:05] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [18:16:25] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [18:24:53] 10Blocked-on-schema-change, 10DBA: Schema change for renaming page_timestamp index on revision table to rev_page_timestamp - https://phabricator.wikimedia.org/T283499 (10LSobanski) p:05Triage→03Medium a:03Kormat Assigning to Stevie to confirm if this can go into Ready. [19:11:09] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [19:14:49] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [19:24:39] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [20:08:46] marostegui: o/, yt? q about some recommended mysql settings for airflow [20:08:47] https://airflow.apache.org/docs/apache-airflow/2.1.0/howto/set-up-database.html#setting-up-a-mysql-database [20:09:49] oh, sorry, backscroll says you've signed off. ok laters! I will ping you tomorrow :) [22:15:32] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.4 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [22:20:58] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 2.6 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [22:26:24] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.6 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [23:10:20] PROBLEM - MariaDB sustained replica lag on pc2008 is CRITICAL: 3 ge 2 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104 [23:13:58] RECOVERY - MariaDB sustained replica lag on pc2008 is OK: (C)2 ge (W)1 ge 0.4 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Replication_lag https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&var-server=pc2008&var-port=9104