[05:19:44] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1162.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202107270519_mar... [05:43:11] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1162.eqiad.wmnet'] ` and were **ALL** successful. [05:50:22] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1162 reimaged (now checking tables). I just realised we also have db1129 with Stretch, which needs to be reimaged too. I will wait for db1162 to finish its check first. [06:29:08] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10RKemper) [07:20:02] s6 on eqiad only got reduced a lot [07:20:14] Last dump for s6 at eqiad (db1140.eqiad.wmnet:3316) taken on 2021-07-27 00:00:01 is 109 GB, but previous one was 92 GB, a change of 18.3% [07:20:34] oh no, I read it wrong [07:20:46] it got increased, I am sure because of wikitech [07:20:53] makes sense [07:21:02] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) After a few days replicating, the following tables have been checked between s6 master (db1173) and m5 master (db1128), no differences found: ` revision te... [07:21:09] I will ack the alert [07:38:20] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) We plan to work on this today. Sadly, for some reason, phabricator didn't send me any email about this issue until the end of my day yesterday, so I had to ge... [07:45:14] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10elukey) [07:48:44] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [07:49:31] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [08:05:47] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Data privacy has been checked: - the scripts didn't find anything - I manually checked all the actions from the triggers to make sure: -- the user table a... [08:11:24] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) @TJones I've restored your old home folder onto mwmaint1002:/home/tjones/backup-restore-2021-07-13--05-05-51 It should have the same access permissions as th... [08:22:12] 10DBA, 10wikitech.wikimedia.org: Move database for wikitech (labswiki) to a main cluster section - https://phabricator.wikimedia.org/T167973 (10Marostegui) Database and grants in place: ` # for i in clouddb1015:3316 clouddb1019:3316 clouddb1021:3316; do mysql.py -h$i labswiki -e "show databases like 'labswiki... [09:27:49] marostegui, will you have some time to check the revert of query killer on s3? [09:28:45] yes [09:31:30] check ops.event_log on, for example, db2074 and let me know what you think [09:31:58] ok one sec [09:33:15] I can just put things back the way they were, but maybe it is worth reviewing things (although it doesn't have to be now) [09:34:51] let me send a patch, not to merge, but to understand what I changed and maybe make things more clear [09:35:56] jynus: so, show events on db2074 looks good [09:36:14] I miss for example, the queries that were killed on the sleeps [09:36:19] that should be an easy fix [09:36:33] let me show you what I did in a hurry [09:38:48] this is what I did (not intended for merging): https://gerrit.wikimedia.org/r/c/operations/software/+/708259 [09:40:21] yep, and that is what needs to be reverted on db2074 (which is the only one I have checked for now) [09:40:42] right? [09:40:53] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:40:56] yeah, no worries about that [09:41:06] my question is I felt very clumsy editing that [09:41:20] maybe the numbers could be in a variable easier to change [09:41:26] I have felt that way anytime I have had to play with those events [09:41:30] they are complex :( [09:41:36] and then other fixes/tunings [09:41:44] not to discuss now [09:41:52] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:42:04] but maybe we can open a ticket and we can propose improvements [09:42:33] unless you have something that needs to be done soon-ish [09:42:53] I don't have time at the moment to work on those events [09:43:04] They require lots of testing [09:43:08] any change I mean [09:43:08] yes, hence my proposal to open a ticket :-D [09:43:37] unless you have something that needs to be done soon-ish -> I answered that [09:43:49] yep [09:44:05] that is another question :-D [09:44:26] I am a bit lost already, feel free to create the ticket about that and we'll see when we'll have time for it [09:44:34] how would you feel about me opening a ticket to add a list of things we don't like about events [09:44:50] sure [09:45:12] and you can add things like what you said "I have felt that way anytime I have had to play with those events" [09:45:27] I will add my own issues with them [09:45:35] and maybe at some point I can even help [09:46:21] my thought about the revert is "uf, yeah, let's revert this, but this felt not great" [09:46:38] (the emergency change didn't feel great) [09:50:30] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10elukey) [09:52:16] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [09:54:17] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:54:31] I've deployed the revert, please keep an eye for issues or anything weird [09:55:35] wilco thanks [10:01:28] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) [10:02:02] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) p:05Triage→03Medium [10:02:37] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) [10:21:15] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [10:22:18] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [10:58:04] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ArielGlenn) [11:17:44] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [11:18:42] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10MoritzMuehlenhoff) [11:28:43] Amir1, "Last dump for s4 at codfw (db2099.codfw.wmnet:3314) taken on 2021-07-27 00:00:02 is 255 GB, but previous one was 346 GB, a change of 26.4%" [11:37:23] 10DBA: Switchover s2 from db2107 to db2104 - https://phabricator.wikimedia.org/T287454 (10Marostegui) [11:38:25] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [11:44:20] jynus: wohoooooo [11:44:32] I should put the party hats on [11:45:22] I think that sped up the backups around the same amount, a 25% faster already [11:46:01] how long it takes now? [11:46:52] let me find the task with the query [11:47:27] jynus: https://phabricator.wikimedia.org/T275268 ? [11:47:36] yes [11:49:04] 10DBA, 10Commons, 10MediaWiki-File-management, 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 4 others: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 (10jcrespo) ` root@db1159.eqiad.wmnet[dbbackups]> nopager; select star... [11:49:06] ^ Amir1 [11:49:40] (the 2 lines is one for each datacenter) [11:49:45] awesome [11:49:50] indeed [11:49:51] so basically cut to half already [11:49:58] I am very happy! [11:50:41] I wonder if it's going to have long-lasting effects on innodb buffer pool efficiency [11:51:18] these rows we are cleaning are basically twice as as size the buffer pool for all of commons [11:51:43] jynus: did es suffer? [11:52:01] Amir1, as in extra size? [11:52:09] yeah [11:52:14] I can check [11:52:23] I assume since it has numeric PK, the back up time wouldn't go high [11:52:43] I think the backup for es hasn't finished yet, we do it with very low concurrency because we don't have dedicated hosts for that [11:53:12] yeah, still ongoing, I can check tomorrow [11:54:15] cool [11:54:40] Awesome. After this, we can do things in commons then, looking forward to shrinking the table there [12:21:11] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [13:19:58] marostegui: I think you asked for a ping when ircservserv is ready a month or so back, so this is your ping: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/YMIDTLZA3ZPR4DJYRACLU4DVE3PN5O6V/ [13:21:11] majavah: ah cool, I will check later [13:21:15] thanks [13:44:16] 10DBA, 10Commons, 10MediaWiki-File-management, 10MW-1.37-notes (1.37.0-wmf.14; 2021-07-12), and 4 others: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 (10Ladsgroup) *puts his party hat on* [14:24:04] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [14:24:15] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1162 check was ok [14:24:57] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) [14:27:55] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by marostegui on cumin1001.eqiad.wmnet for hosts: ` ['db1129.eqiad.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/202107271427_mar... [14:35:23] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10TJones) Thanks, @jcrespo! It looks like everything I need is there. [14:37:28] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Vgutierrez) [14:37:36] 10DBA, 10Data-Services, 10cloud-services-team (Kanban): Prepare and check storage layer for banwikisource - https://phabricator.wikimedia.org/T284390 (10Nintendofan885) [14:37:55] 10DBA, 10Data-Services: Prepare and check storage layer for banwikisource - https://phabricator.wikimedia.org/T286684 (10Nintendofan885) [14:43:56] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [14:47:10] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 4 host(s) and their services with reason: Eqiad row B maintenance ` c... [14:51:18] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['db1129.eqiad.wmnet'] ` and were **ALL** successful. [14:51:32] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row B maintenance ` a... [14:52:29] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Vgutierrez) [14:53:00] 10DBA, 10Patch-For-Review: Upgrade s2 to Buster + MariaDB 10.4 - https://phabricator.wikimedia.org/T287230 (10Marostegui) db1129 reimaged, now checking its tables. [14:55:28] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin1001 for 1:00:00 1 host(s) and their services with reason: Eqiad row B maintenance ` l... [14:55:56] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Vgutierrez) [14:57:23] marostegui: i assume https://phabricator.wikimedia.org/T287481 can be closed because you were reimaging it [14:57:31] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [14:57:42] RhinosF1: Yep, thanks! [15:08:57] 10Data-Persistence-Backup, 10SRE, 10bacula: Restore ~tjones/reindex directory from mwmaint1002 - https://phabricator.wikimedia.org/T287304 (10jcrespo) 05Open→03Resolved [15:10:03] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [15:12:32] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) >>! In T286061#7231980, @Marostegui wrote: > m1-master.eqiad.wmnet switched over to dbproxy1012 which is on row A. Once this row is... [15:16:00] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [15:18:12] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10Marostegui) [15:18:35] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) [15:19:12] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Bstorm) [15:20:52] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [17:09:25] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops, 10Patch-For-Review: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) 05Open→03Resolved [19:41:34] 10DBA, 10Analytics, 10Event-Platform, 10WMF-Architecture-Team, 10Services (later): Consistent MediaWiki state change events | MediaWiki events as source of truth - https://phabricator.wikimedia.org/T120242 (10Ottomata) > the idea of a commit log that contains the entire history of all events [...] I gath... [21:15:10] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) received replacement drive [21:17:32] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Jclark-ctr) @Marostegui can this drive be replaced? [23:13:34] Do we still have separate replicas for `watchlist` queries? [23:13:48] It's a little hard for me to tell because it's in dbctl now...