[05:01:28] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui) I have switched m3-master from dbproxy1020 to dbproxy1016: https://gerrit.wikimedia.org/r/705789
[05:02:00] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Marostegui)
[05:12:28] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10MW-1.37-notes (1.37.0-wmf.12; 2021-06-28): Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 (10Marostegui)
[05:13:06] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10MW-1.37-notes (1.37.0-wmf.12; 2021-06-28): Schema change for renaming several indexes in logging table - https://phabricator.wikimedia.org/T270620 (10Marostegui) 05Open→03Stalled All eqiad is done - waiting for the switch back.
[05:17:28] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10Marostegui) @Jclark-ctr did this disk arrive?
[05:55:34] <wikibugs>	 10DBA, 10Toolhub, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) >>! In T271480#7225145, @bd808 wrote: >  >> * Will you be deploying the application in one or both DCs? >  > This is a great question that I do not know the answer to definitive...
[07:05:30] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff)
[08:33:55] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff)
[08:36:08] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 3 others: Switch buffer re-partition - Eqiad Row D - https://phabricator.wikimedia.org/T286069 (10cmooney) 05Open→03Resolved
[08:51:33] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[08:52:41] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney)
[08:52:51] <marostegui>	 kormat: any maintenance on going on s6 or expected today? 
[08:52:59] <marostegui>	 I want to start playing with wikitech migration
[08:53:19] <kormat>	 marostegui: nothing you need to worry about 😇
[08:53:29] <marostegui>	 good! thank you
[08:54:02] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 (10Kormat)
[08:54:53] <wikibugs>	 10Blocked-on-schema-change, 10DBA, 10AbuseFilter: Rename AbuseFilter indexes for consistency - https://phabricator.wikimedia.org/T281058 (10Kormat) 05Open→03Stalled Stalling this until we switch back to eqiad.
[08:55:05] <kormat>	 marostegui: all the schema changes we had scheduled are now done in eqiad
[08:55:13] <marostegui>	 \o/
[08:55:16] <marostegui>	 great news
[08:55:37] <kormat>	 some of them may have to be re-done on the 2-3 failed hosts we have right now, but that's it
[08:56:17] <marostegui>	 they were not done via replication?
[08:56:33] <kormat>	 hard to replicate when the host is dead, etc.
[08:56:48] <kormat>	 ah. well it depends on whether we need to restore the hosts from backups or not
[08:57:02] <marostegui>	 but backups would have the change too, no?
[08:57:06] <kormat>	 e.g. we have 2 hosts with memory corruption. we probably want to at least do a full data check on them when they're fixed
[08:57:12] <marostegui>	 oh yeah, definitely
[08:57:13] <kormat>	 marostegui: depends when the hosts are back/backups were run/etc
[08:57:18] <marostegui>	 true 
[08:57:59] <kormat>	 i have scripts for every schema change i ran, so it's trivial to check/re-run them if necessary
[08:58:18] <marostegui>	 cooool
[08:59:10] <kormat>	 i think i'm at a place now where i could write a python program to handle most schema changes
[09:01:09] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10cmooney)
[09:01:22] <jynus>	 one more mariadb server that crashed: db2097@s1
[09:01:28] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[09:02:13] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10cmooney)
[09:04:50] <kormat>	 jynus: oh, ow. another hardware failure
[09:05:02] <jynus>	 :-(
[09:05:12] <jynus>	 is there a pattern, as in same batch?
[09:05:22] <marostegui>	 <kormat> i think i'm at a place now where i could write a python program to handle most schema changes -> <3<3
[09:05:31] <kormat>	 db2097 has been having hardware failures since the 19th
[09:06:15] <jynus>	 what is your recommendation regarding recovery, should I recover from an older backup?
[09:06:27] <jynus>	 older than 19th?
[09:06:42] <jynus>	 and to a different server?
[09:07:31] <kormat>	 jynus: i don't know if the memory failure can directly cause corruption. i'd run `mysqlcheck -A` and see what that says
[09:08:25] <jynus>	 I don't mind being extra safe, even if it is more work for me, given it is the backup source
[09:08:43] <jynus>	 last thing I want is corruption to "infect" other hosts
[09:09:13] <jynus>	 however, that is a stretch host, we have a buster one ready long time ago
[09:09:59] <jynus>	 let me handle the hw issue and I will ask you for your thought on how to proceed, once I see the options
[09:10:25] <marostegui>	 I am going to start working with s6 eqiad, expect lag there
[09:15:28] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo)
[09:15:41] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) p:05Triage→03High a:03jcrespo
[09:22:36] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) s2 mariadb log, on the same host seems clean.  This are some weird hw logs:   ` /map1/log1/record352   Targets   Properties     number=352     s...
[09:29:35] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10database-backups: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) From the web interface:   ` "ID","Severity","Class","Description","Last Update","Count","Category", "85","Critical","CPU","Uncorrectable Machine...
[09:53:40] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) a:05jcrespo→03Papaul As expected, the fauly memory module is only properly detected on reboot.   ` free -g               tota...
[10:22:25] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10hnowlan)
[10:27:41] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff)
[10:28:27] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10MoritzMuehlenhoff)
[10:37:10] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) We need to fix the craziness of current partitioning on all servers: ` # cumin 'P:mediabackup::storage' 'lsblk -b /dev/sdc' 8 hosts will be targeted:...
[10:41:11] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) There is an initial grafana dashboard, but will need a lot of work, it is almost unusuable for now (not sure if because of the lack of activity, the m...
[11:15:05] <wikibugs>	 10DBA, 10Wikimedia-General-or-Unknown: Properly delete https://su.wikiquote.org/wiki/MédiaWiki:Enotif_body - https://phabricator.wikimedia.org/T286185 (10Aklapper) Not sure who could look into this; adding #DBA and feel free to remove if I'm wrong. In short, https://su.wikiquote.org/wiki/MédiaWiki:Enotif_body?...
[12:50:23] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) @jcrespo  I will request for HP to send us a new DIMM
[13:01:48] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Thank you!
[13:05:59] <wikibugs>	 10DBA, 10Infrastructure-Foundations, 10SRE, 10Traffic, and 2 others: Switch buffer re-partition - Eqiad Row A - https://phabricator.wikimedia.org/T286032 (10fgiunchedi)
[13:06:53] <wikibugs>	 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10fgiunchedi)
[13:11:53] <Amir1>	 marostegui: jynus the migration for image table is now 15% done, I expect it to finish in two weeks.
[13:12:29] <Amir1>	 btw, if there are really large logging table, let me know and I cross check with flaggedrevs and if that's the case like ruwiki, I can clean them
[13:13:05] <Amir1>	 (can you run a check of the size of the table across the cluster? Maybe through backups?)
[13:13:13] <jynus>	 sure
[13:13:37] <Amir1>	 Thanks
[13:14:02] <jynus>	 I have observed no backup issue regarding that, but maybe dbas have specific worries there
[13:14:09] <jynus>	 I will get you the stats
[13:16:47] <jynus>	 Amir1, what should I search, top list of logging tables per wiki?
[13:17:15] <Amir1>	 biggest logging tables 
[13:17:36] <Amir1>	 which wiki they are
[13:17:56] <jynus>	 ok, give me a second, the data is in a non-ideal format
[13:23:20] <Amir1>	 Thanks
[13:31:07] <jynus>	 Amir1: https://phabricator.wikimedia.org/P16839
[13:31:28] <Amir1>	 nice
[13:31:39] <Amir1>	 dewiki can be cleaned up, same with plwiki
[13:32:05] <Amir1>	 I haven't finished ruwiki, I'll do that soon
[13:33:26] <Amir1>	 I wonder what can be done with commonswiki, probably nothing
[14:00:33] <marostegui>	 Amir1: I did ruwiki in eqiad
[14:00:49] <marostegui>	 if we can do some other big big ones, we should do it before we switch back
[14:01:08] <Amir1>	 marostegui: no, I think it'll be around 10% of it
[14:01:12] <Amir1>	 not much
[14:01:26] <Amir1>	 it'll help keep the space for growth
[14:01:44] <Amir1>	 specially since logging is by design just grows forever
[14:02:41] <marostegui>	 aaah ok
[14:21:57] <wikibugs>	 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) Case Reference ID: 5357298848 Status: Case is generated and in Progress Subject: HPE ProLiant DL360 Gen10 - DIMM Failed P...
[15:26:56] <Amir1>	 oh I have some numbers on image table clean up
[15:27:01] <Amir1>	 Current status
[15:27:09] <Amir1>	 https://www.irccloud.com/pastebin/4IlfnyXE/
[15:27:33] <Amir1>	 511 is the total size of img_metadata of pdfs, on 2.9M rows
[15:27:39] <Amir1>	 the rest are percentiles
[15:27:49] <Amir1>	 This is when I started the script
[15:27:52] <Amir1>	 https://www.irccloud.com/pastebin/rDvSteWW/
[15:28:50] <Amir1>	 e.g. the median went from 25KB to 13KB. The 99th percentile went from 2.7MB to 2.4MB 
[15:30:35] <Amir1>	 total size went from 636GB to 512GB (uncompressed)
[16:04:27] <marostegui>	 how much % is done?
[16:04:37] <marostegui>	 so those numbers are based on the current progress no?
[16:41:21] <wikibugs>	 10DBA, 10SRE, 10ops-eqiad: Degraded RAID on db1129 - https://phabricator.wikimedia.org/T285715 (10wiki_willy) a:05Cmjohnson→03Jclark-ctr
[16:42:54] <Amir1>	 marostegui: the numbers are result of querying the db (analytics cluster), with replication and stuff it's a bit outdated but it's real data not extrapolation 
[16:47:29] <Amir1>	 when I made the queries it was 15% done
[16:51:00] <wikibugs>	 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10jcrespo) The next step on productionization of workers is to setup the account for access to mw content on swift. This is documented at: https://wikitech.wikim...
[18:03:15] <wikibugs>	 10DBA, 10Toolhub, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10bd808) >>! In T271480#7226296, @Marostegui wrote: > multi-dc as being read from both DCs or even written from both DCs? > This database is likely to go to a misc cluster, which isn't ready...
[19:25:31] <wikibugs>	 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) a:03Legoktm
[21:53:15] <wikibugs>	 10DBA, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10Legoktm)
[22:44:12] <wikibugs>	 10DBA, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: Figure out how x2 should be handled in DC switchover - https://phabricator.wikimedia.org/T285519 (10Legoktm) 05Open→03Resolved Still needs a new spicerack release, but hopefully finally fixed now :)