[05:14:14] 10DBA, 10SRE, 10ops-codfw: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Marostegui) 05Open→03Resolved Closing this as the modules were swapped, we'll see if it happens again, if so, let's reopen. BIOS and firmware were upgraded too [06:02:16] marostegui: btw https://grafana.wikimedia.org/d/000000278/mysql-aggregated?viewPanel=7&orgId=1&var-site=codfw&var-group=core&var-shard=s5&var-role=All this is flagegdrevs logs being cleaned up in dewiki, maybe we should optimize its logging table before switching back (I'll make a ticket once done) [06:50:33] Amir1: sounds good yeah, just create a task and assign it to me when ready [06:50:54] sure [07:02:12] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is done, we need to revert that. [07:02:42] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [07:26:10] 10DBA: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [07:26:27] 10DBA: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) p:05Triage→03Medium [07:28:29] s2 next? [07:28:43] yep [07:28:46] that works for you? [07:28:58] yeah, timeframe more or less? [07:29:09] next week? [07:29:14] I would want to start working on eqiad next week, but not codfw [07:29:41] so the only change for this is that I will reimage dbprovX002 [07:30:08] I will add it to the list of tasks [07:30:33] sounds good! thanks [07:30:48] I will replace the "Available backup source" with that, as we have all backup sources available now [07:31:03] (no confirmation needed :-D) [07:31:07] Could I work upgrading eqiad candidate master next week or should I wait for you first? [07:31:43] yes as in, you should wait, but you won't need to wait a lot, I can have it done by monday [07:31:50] ah no worries yeah [07:32:00] just le me know when I can proceed, no rush [07:32:04] the only thing is [07:32:25] once I reimage, we shouldn't take a lot of time between reimage and master upgrade [07:32:45] so let me know which day you are aiming and I will sync with you [07:32:46] yeah, that's fine, I can do candidate one day and switchover eqiad master the following if all goes fine [07:32:54] I am talking about eqiad for now [07:33:07] yeah, between dcs there is no isse [07:33:12] it can be a long time [07:33:55] what I "need" is not a long time between dbprov reimage for a dc and the upgrade of the master/complte upgrade of the same dc [07:34:08] 10DBA: Reboot, upgrade firmware and kernel of db1096-db1106, db2071-db2092 - https://phabricator.wikimedia.org/T216240 (10Marostegui) [07:34:20] yeah, it won't take more than 2 days if there are no issues with the reimages [07:34:21] e.g. I upgrade it on monday and you do the master stuff the next day [07:34:36] (but the day doesn't matter) [07:35:25] so should I do eqiad on monday, or should I wait? [07:36:04] that's good, I can do the candidate on tuesday and the master on wed [07:36:10] perfect [07:36:22] again, no rush- if you need more time, I can delay [07:36:32] no, that's good [07:36:35] but I may need a day in case the reimage fails or something goes wrong [07:36:40] yep [07:36:50] even if it should be in theory a 5 minute task :-D [07:37:13] ok, I will add the items and comment there, thank you for your help [07:37:19] cheers [07:40:53] 10DBA: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) [07:46:56] 10DBA: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) So the "change" for this run is that now, after s2 change, we will have so little number of stretch backups that we should increase the number of buster dbprovs from 1 to 2 (out of 3), so we can distribute backup g... [08:29:24] marostegui: waait. didn't we say we were going to avoid doing section upgrades while we're in codfw? [08:29:35] otherwise we'll need to upgrade codfw before we switch back [08:30:07] kormat: Yeah, we said avoid doing partial section upgrades (as in upgrading eqiad only and then switch back directly to 10.4) [08:30:21] I am planning to do a full one including a switchover [08:30:32] ok, masochistegui [08:30:40] hahaha [08:34:55] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10jcrespo) ^I have my patches ready, will wait for Monday morning backups to finish and then reimage dbprov1002 . [10:01:04] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Current status after the work done this week: - wikitech database placed in eqiad s6 hosts - Replication is enabled on eqiad s6 master (db1173), with multi... [10:24:04] FYI, I had this warning from yesterday: "Last snapshot for s6 at eqiad (db1140.eqiad.wmnet:3316) taken on 2021-07-21 21:50:02 is 587 GB, but previous one was 557 GB, a change of 5.3%" [11:48:16] 10DBA: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244 (10Marostegui) p:05Triage→03Medium [11:49:29] 10DBA: Considering switching innodb_checksum_algorithm=full_crc32 - https://phabricator.wikimedia.org/T287244 (10Marostegui) I am going to turn this ON for now on the new parsercache hosts on both eqiad and codfw, and see how they do (ie: no unexpected crashes and such). If that goes well, I will enable it on al... [21:31:26] 10Data-Persistence-Backup: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10TJones) [21:47:31] 10Data-Persistence-Backup, 10Data-Persistence, 10SRE: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10RLazarus) My naive attempt at https://wikitech.wikimedia.org/wiki/Bacula#Restore_(aka_Panic_mode) went fine until the decryption phase, at which point "Error... [22:20:52] 10Data-Persistence-Backup, 10Data-Persistence, 10SRE: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10RLazarus) (Oh, and the timestamp came from T267607#7208278.)