[05:01:28] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) I have switched m3-master from dbproxy1020 to dbproxy1016: https://gerrit.wikimedia.org/r/705789 [05:02:00] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) [05:12:28] 10Blocked-on-schema-change, 10DBA, 10MW-1.37-notes (1.37.0-wmf.12; 2021-06-28): Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 (10Marostegui) [05:13:06] 10Blocked-on-schema-change, 10DBA, 10MW-1.37-notes (1.37.0-wmf.12; 2021-06-28): Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 (10Marostegui) 05Open→03Stalled All eqiad is done - waiting for the switch back. [05:17:28] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) @Jclark-ctr did this disk arrive? [05:55:34] 10DBA, 10Toolhub, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) >>! In T271480#7225145, @bd808 wrote: > >> * Will you be deploying the application in one or both DCs? > > This is a great question that I do not know the answer to definitive... [07:05:30] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [08:33:55] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [08:36:08] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) 05Open→03Resolved [08:51:33] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:52:41] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [08:52:51] kormat: any maintenance on going on s6 or expected today? [08:52:59] I want to start playing with wikitech migration [08:53:19] marostegui: nothing you need to worry about 😇 [08:53:29] good! thank you [08:54:02] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 (10Kormat) [08:54:53] 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 (10Kormat) 05Open→03Stalled Stalling this until we switch back to eqiad. [08:55:05] marostegui: all the schema changes we had scheduled are now done in eqiad [08:55:13] \o/ [08:55:16] great news [08:55:37] some of them may have to be re-done on the 2-3 failed hosts we have right now, but that's it [08:56:17] they were not done via replication? [08:56:33] hard to replicate when the host is dead, etc. [08:56:48] ah. well it depends on whether we need to restore the hosts from backups or not [08:57:02] but backups would have the change too, no? [08:57:06] e.g. we have 2 hosts with memory corruption. we probably want to at least do a full data check on them when they're fixed [08:57:12] oh yeah, definitely [08:57:13] marostegui: depends when the hosts are back/backups were run/etc [08:57:18] true [08:57:59] i have scripts for every schema change i ran, so it's trivial to check/re-run them if necessary [08:58:18] cooool [08:59:10] i think i'm at a place now where i could write a python program to handle most schema changes [09:01:09] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney) [09:01:22] one more mariadb server that crashed: db2097@s1 [09:01:28] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [09:02:13] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney) [09:04:50] jynus: oh, ow. another hardware failure [09:05:02] :-( [09:05:12] is there a pattern, as in same batch? [09:05:22] i think i'm at a place now where i could write a python program to handle most schema changes -> <3<3 [09:05:31] db2097 has been having hardware failures since the 19th [09:06:15] what is your recommendation regarding recovery, should I recover from an older backup? [09:06:27] older than 19th? [09:06:42] and to a different server? [09:07:31] jynus: i don't know if the memory failure can directly cause corruption. i'd run `mysqlcheck -A` and see what that says [09:08:25] I don't mind being extra safe, even if it is more work for me, given it is the backup source [09:08:43] last thing I want is corruption to "infect" other hosts [09:09:13] however, that is a stretch host, we have a buster one ready long time ago [09:09:59] let me handle the hw issue and I will ask you for your thought on how to proceed, once I see the options [09:10:25] I am going to start working with s6 eqiad, expect lag there [09:15:28] 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) [09:15:41] 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) p:05Triage→03High a:03jcrespo [09:22:36] 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) s2 mariadb log, on the same host seems clean. This are some weird hw logs: ` /map1/log1/record352 Targets Properties number=352 s... [09:29:35] 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) From the web interface: ` "ID","Severity","Class","Description","Last Update","Count","Category", "85","Critical","CPU","Uncorrectable Machine... [09:53:40] 10DBA, 10Data-Persistence-Backup, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) a:05jcrespo→03Papaul As expected, the fauly memory module is only properly detected on reboot. ` free -g tota... [10:22:25] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10hnowlan) [10:27:41] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [10:28:27] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff) [10:37:10] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) We need to fix the craziness of current partitioning on all servers: ` # cumin 'P:mediabackup::storage' 'lsblk -b /dev/sdc' 8 hosts will be targeted:... [10:41:11] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) There is an initial grafana dashboard, but will need a lot of work, it is almost unusuable for now (not sure if because of the lack of activity, the m... [11:15:05] 10DBA, 10Wikimedia-General-or-Unknown: Properly delete https://su.wikiquote.org/wiki/MédiaWiki:Enotif_body - https://phabricator.wikimedia.org/T286185 (10Aklapper) Not sure who could look into this; adding #DBA and feel free to remove if I'm wrong. In short, https://su.wikiquote.org/wiki/MédiaWiki:Enotif_body?... [12:50:23] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) @jcrespo I will request for HP to send us a new DIMM [13:01:48] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Thank you! [13:05:59] 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi) [13:06:53] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi) [13:11:53] marostegui: jynus the migration for image table is now 15% done, I expect it to finish in two weeks. [13:12:29] btw, if there are really large logging table, let me know and I cross check with flaggedrevs and if that's the case like ruwiki, I can clean them [13:13:05] (can you run a check of the size of the table across the cluster? Maybe through backups?) [13:13:13] sure [13:13:37] Thanks [13:14:02] I have observed no backup issue regarding that, but maybe dbas have specific worries there [13:14:09] I will get you the stats [13:16:47] Amir1, what should I search, top list of logging tables per wiki? [13:17:15] biggest logging tables [13:17:36] which wiki they are [13:17:56] ok, give me a second, the data is in a non-ideal format [13:23:20] Thanks [13:31:07] Amir1: https://phabricator.wikimedia.org/P16839 [13:31:28] nice [13:31:39] dewiki can be cleaned up, same with plwiki [13:32:05] I haven't finished ruwiki, I'll do that soon [13:33:26] I wonder what can be done with commonswiki, probably nothing [14:00:33] Amir1: I did ruwiki in eqiad [14:00:49] if we can do some other big big ones, we should do it before we switch back [14:01:08] marostegui: no, I think it'll be around 10% of it [14:01:12] not much [14:01:26] it'll help keep the space for growth [14:01:44] specially since logging is by design just grows forever [14:02:41] aaah ok [14:21:57] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) Case Reference ID: 5357298848 Status: Case is generated and in Progress Subject: HPE ProLiant DL360 Gen10 - DIMM Failed P... [15:26:56] oh I have some numbers on image table clean up [15:27:01] Current status [15:27:09] https://www.irccloud.com/pastebin/4IlfnyXE/ [15:27:33] 511 is the total size of img_metadata of pdfs, on 2.9M rows [15:27:39] the rest are percentiles [15:27:49] This is when I started the script [15:27:52] https://www.irccloud.com/pastebin/rDvSteWW/ [15:28:50] e.g. the median went from 25KB to 13KB. The 99th percentile went from 2.7MB to 2.4MB [15:30:35] total size went from 636GB to 512GB (uncompressed) [16:04:27] how much % is done? [16:04:37] so those numbers are based on the current progress no? [16:41:21] 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr [16:42:54] marostegui: the numbers are result of querying the db (analytics cluster), with replication and stuff it's a bit outdated but it's real data not extrapolation [16:47:29] when I made the queries it was 15% done [16:51:00] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) The next step on productionization of workers is to setup the account for access to mw content on swift. This is documented at: https://wikitech.wikim... [18:03:15] 10DBA, 10Toolhub, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7226296, @Marostegui wrote: > multi-dc as being read from both DCs or even written from both DCs? > This database is likely to go to a misc cluster, which isn't ready... [19:25:31] 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) a:03Legoktm [21:53:15] 10DBA, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10Legoktm) [22:44:12] 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) 05Open→03Resolved Still needs a new spicerack release, but hopefully finally fixed now :)