[02:18:50] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) ` In reference to your Hewlett Packard Enterprise Support Case Number 5357298848, the following Customer Self Repair Par... [05:02:23] 10DBA, 10ops-eqiad: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10Marostegui) p:05Triage→03Medium Can we get a new disk for this host? [05:26:59] 10DBA, 10Toolhub, 10User-bd808: Discuss database needs with the DBA team - https://phabricator.wikimedia.org/T271480 (10Marostegui) >>! In T271480#7228259, @bd808 wrote: >>>! In T271480#7226296, @Marostegui wrote: >> multi-dc as being read from both DCs or even written from both DCs? >> This database is like... [07:02:58] 10DBA, 10DC-Ops, 10SRE, 10ops-eqiad: db1170 mysql process crashed - https://phabricator.wikimedia.org/T286888 (10Marostegui) This host was pooled for dumps which is not moved to codfw, so it can potentially cause issues if dumps were about to start. I have depooled it and placed others in s2 and s7 to serv... [09:01:03] 10DBA, 10Infrastructure-Foundations, 10SRE-tools, 10Spicerack, 10Datacenter-Switchover: switchdc should verify active/active DBs are read-write in both datacenters - https://phabricator.wikimedia.org/T287129 (10LSobanski) Certainly makes sense. To be sure I understand the expectations, who owns making th... [09:02:34] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [09:26:14] 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10SRE, 10bacula, 10netops: Understand (and mitigate) the backup speed differences between backup1002->backup2002 and backup2002->backup1002 - https://phabricator.wikimedia.org/T274234 (10cmooney) Thanks @jcrespo. Yes this makes perfect sense. Due... [09:26:21] 10Data-Persistence-Backup, 10SRE, 10Goal, 10Patch-For-Review: Puppetize media backups infrastructure - https://phabricator.wikimedia.org/T276442 (10fgiunchedi) >>! In T276442#7228056, @jcrespo wrote: > The next step on productionization of workers is to setup the account for access to mw content on swift.... [11:41:26] 10DBA, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row B - https://phabricator.wikimedia.org/T286061 (10Marostegui) [12:47:17] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) >>! In T284825#7154152, @Marostegui wrote: > codfw hosts are now ready to be productionized as the racking and installing task in codfw is done (T282482) > Reminder: set this hosts in... [12:48:15] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) [12:49:07] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) a:03Kormat [12:54:05] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) The new pc hosts in codfw are now in service. They're replicating from a blank start, so it will take 3 weeks for them to be populated fully. Once that's done, we can make one or more... [13:06:21] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) `/srv` resized on all eqiad hosts: ` (4) pc[1011-1014].eqiad.wmnet ----- OUTPUT of... [13:50:45] 10Blocked-on-schema-change, 10DBA, 10Dumps-Generation: Schema change for dropping rev_page_id index - https://phabricator.wikimedia.org/T285149 (10Marostegui) If we finally want to go ahead with this (if @ArielGlenn find no issues) we should try to do it (at least eqiad) before the switch back, scheduled for... [14:18:34] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10RobH) [14:40:51] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin2002 for 1:00:00 4 host(s) and their services with reason: Eqiad row C maintenance ` cp[108... [14:42:54] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [14:49:27] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10herron) [14:50:20] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10ops-monitoring-bot) Icinga downtime set by mmandere@cumin2002 for 1:00:00 1 host(s) and their services with reason: Eqiad row C maintenance ` lvs101... [14:55:40] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10Vgutierrez) [15:04:04] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10MoritzMuehlenhoff) [15:04:55] 10DBA, 10ops-eqiad: Degraded RAID on db1175 - https://phabricator.wikimedia.org/T287137 (10wiki_willy) a:03Jclark-ctr [15:07:50] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, and 2 others: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) [15:10:22] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10herron) [15:15:27] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Host should be down now @Papaul [15:17:44] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) All hosts are now in service. Including: - sys schema deployed - set to 'active' in netbox [15:17:58] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) [15:18:08] 10DBA, 10Patch-For-Review: Productionize pc2011-pc2014 and pc1011-pc1014 - https://phabricator.wikimedia.org/T284825 (10Kormat) 05Open→03Resolved [15:22:10] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10aborrero) [15:23:06] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) [15:24:36] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) Looking good: ` $ free -g total used free shared buff/cache available Mem:... [15:30:40] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10jcrespo) I doubled confirmed all dimms "Good, In use". Thank you, @Papaul for the quick response! ` PROC 1 DIMM 3 Good, In Use... [15:30:50] 10DBA, 10Data-Persistence-Backup, 10SRE, 10database-backups, 10ops-codfw: db2097@s1 got killed due to hardware memory corruption - https://phabricator.wikimedia.org/T287072 (10Papaul) 05Open→03Resolved Return DIMM information {F34560100} [15:32:32] Heads up- I restored db2097 s1 from backup taken a few hours before it crashed [15:33:25] if you did any schema change between Jul 20 20:00 and Jul 21 11:34 without using replication, it could have been lost [15:33:30] (on s1) [15:36:25] I didn't do any [15:36:34] kormat: ^? [15:37:14] I expect nothing was ongoing then, but you know I prefer to at least warn here to prevent mistakes [15:39:46] i didn't either. but i can also trivially check [15:41:56] worst case, it'll show up in the drift reports [15:42:31] and then we'll have to talk to you, amir1 [15:42:34] yeah, definitely worst-case [15:42:37] not sure if worth it, but if at some point all drifts are corrected, maybe we could add some tracking like ops.schema_version {id} [15:43:19] I think that's something to add on the tool kormat promised to have next week to handle schema changes :-D [15:43:31] next week? wohooo [15:44:12] jynus: I don't know if you've seen this https://drift-tracker.toolforge.org/report/core/ [15:44:36] I need to rerun it, i didn't do it during the switchover [15:45:33] Amir1, is that done looking at wikireplicas, or mw hosts? [15:46:05] only main group, not cloud replicas or analytics [15:46:57] so db2097 is not in mediawiki config, not sure if it would be checked, let me see [15:47:35] this is eqiad only [15:47:54] ok, let me check the eqiad equivalent of db2097 [15:48:57] that would be db1139 [15:49:56] yeah, I think it wouldn't be checked [15:50:17] as I imagine you get the list from mw (which is completely fine) [15:51:58] marostegui, if you are still around, did you see the link I sent to K and L a couple of days ago about orchestrator? [15:52:11] jynus: he has, yes [15:52:12] jynus: yes, I am aware [15:52:15] ah, ok [16:07:52] 10DBA, 10ops-codfw: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Marostegui) [16:08:10] 10DBA, 10ops-codfw: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Marostegui) p:05Triage→03Medium [16:10:25] 10DBA, 10ops-codfw: db2091 memory errors - https://phabricator.wikimedia.org/T287182 (10Papaul) [17:02:15] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) All went very well with the change, this time I ran rapid ping from the CR to see if any packet loss was observed, and did detect some loss,... [17:02:31] 10DBA, 10Analytics, 10Infrastructure-Foundations, 10SRE, 10netops: Switch buffer re-partition - Eqiad Row C - https://phabricator.wikimedia.org/T286065 (10cmooney) 05Open→03Resolved