[07:38:09] good morning marostegui [07:38:14] o/ [07:38:53] let me know if you need me on something [07:39:22] you have enough things on your plate! [08:27:04] Pah. In theory I can get my COVID booster moved sooner, but there's no availability locally at all. [08:30:20] marostegui: 🥺 https://gerrit.wikimedia.org/r/c/operations/puppet/+/742675/ [08:41:02] i commented btw [08:48:44] marostegui: so given that the localhost user is already missing in some replicas and nothing has been reported broken (for years) can I just move forward to dropping the user in a full section (and next week dropping it from everywhere?) [08:52:01] +1 [08:59:41] Amir1: I would choose s6 [08:59:49] sure [08:59:58] as there are only 4 wikis there so we can narrow things (frwiki, jawiki, ruwiki and wikitech) [09:27:35] I bet you're already aware, but just in case: https://mariadb.org/fosdem-filming-party/ [09:30:21] hahaha yeah [09:34:07] Gosh I hate this weather [09:34:12] grabbing coffee now [09:38:42] Amir1: same here, we are only having 16C today :( pretty cold [09:41:06] "only 16C" [09:41:14] me: -10C [09:41:31] marostegui: 🤬 [09:42:01] I HAVE TO WEAR TWO WINTER JACKETS [09:42:40] it was -4 here on Sunday night, but it's about +10 today, which is fine (especially given central heating :) [09:44:57] majavah: I've never been into Finland but I've been in Estonia in December, that was... interesting [09:45:23] I also remember a massive river in Riga (Lativa) was completely frozen at that time [09:46:08] majavah: hahahahaha [09:47:09] * Emperor went to Finland (Inari) in February a few years ago. That was fun :) [if not 100% toasty warm at all times] [09:57:37] here most houses do not have central heating [10:02:40] Amir1: the chemical enum ticket requires first to merge the patch you sent and then the new schema change patch? [10:02:54] to include the default unknown? [10:16:19] back [10:16:58] marostegui: it's a bit complicated, tldr is that we don't need that patch for production. That patch is for when update.php was being ran automatically (like beta cluster) and removed the default [10:17:15] new systems should still have the default [10:17:21] let me double check to confirm [10:18:13] yep, they do, just checked amiwiki which is a new recently created wiki [10:18:24] the schema change patch is wrong though [10:18:33] yeah, update.php code is a different path than create wiki code [10:18:55] yeah, but the ALTER itself: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/135756/13/maintenance/archives/patch-img_major_mime-chemical.sql [10:18:59] is that ok to leave as it is? [10:19:04] create wiki just grabs content of tables.sql and build that [10:19:26] so the alter has been removed because it's too old (we don't support upgrades from 1.29 or older) [10:20:28] the patch to fix mistakes caused by running the alter waiting to be merged (but also it's buggy because T296615) [10:20:28] T296615: MySQLField::defaultValue() returns empty string all the time - https://phabricator.wikimedia.org/T296615 [10:20:53] Ah I see [10:21:02] So the above patch is basically not useful [10:21:04] if I confused you more, more than happy to jump on call and explain in depth [10:21:11] no no, I get it [10:21:21] it's useful, the problem is that it's too useful :D [10:21:42] it just doesn't care, it tries to set the default every damn time update.php is being ran [10:21:44] I was trying to understand whether I should wait for a new patch to be merged (like we do with any schema change) or if I could proceed with the schema change (but adding the default manually myself) [10:22:07] from mw viewpoint the latter is okay [10:22:51] specially given that it doesn't change the status quo [10:23:37] so the alters I added at https://phabricator.wikimedia.org/T277354 look good then? [10:23:44] or this: https://phabricator.wikimedia.org/T277354#7536457 [10:26:55] it looks good to me [10:27:00] goooood [10:27:16] I will proceed then [13:50:53] ugh, hardware RAID is so hateful [13:51:31] I thought that too until I got to work with mdadm [13:51:44] Then HW raid and myself became bff [13:53:30] At least with mdadm I feel like I have a fighting chance of knowing which disk is caput [13:53:40] haha [13:54:04] why? usually it's fairly explicit [13:54:48] but you know... it's "hard"ware [13:54:52] I have seen sometimes disks not being detected as bad, but with very high media errors and as soon as you mark them as bad manually then the performance goes back to normal [13:55:22] or a disk and slot disappear from the array [13:55:30] you can recognize it just by the absense of it [13:55:54] but usually that does show up on the controllers' log [13:56:25] yes but when you run our get_* scripts they are not marked there as failed because missing [13:56:37] ah yeah [13:59:01] volans: the mapping of "/dev/sdr" to a physical device is non-obvious - I can find a device in megacli output with increased Media Error Count (so I'm happy this is the sad device), and can then use the Device ID to ask smartctl to find me the Serial number (which is present in Inquiry Data, but concatenated with the model number unhelpfully). But if I want to double-check this is right I then have to compare the DiskGroup with [13:59:01] /dev/disk/by-path entries, and this is all a total faff (and feels very error-prone) [14:01:11] JBOD and mdadm (or Ceph which just uses the disk directly) make this much more straightforward [14:02:09] got it [14:03:08] you're referring specifically to the swift case [14:03:49] where we do basically JBOD but in a way that creates a virtual drive per drive with 1 physical disk each [14:03:58] Mmm [14:05:22] in all cases where thre is some redundancy you usually don't care how the virtual disk is mapped on the OS as you just hot swap the failed disk and you just need to know which one it is at the physical layer by DCOps either by position or making it blink [14:05:51] "stare at the disks. whichever blinks first is the culprit" [14:06:46] "fond" memories of $VENDOR kit at the last place where failed disks would typically vanish from the OS (i.e. the /dev/sdX entry would go) meaning you couldn't then light their failure LED [14:07:28] ...so we had a shell rune for "light up every other LED, ask DC team to pull the sole unlight drive" [14:08:28] rotfl [14:08:31] that's sad! [14:14:31] jynus: i was just looking at that, but you got there first :) re: https://phabricator.wikimedia.org/T254646#7537467 [14:14:43] :-) [15:51:53] db1139:s1 caught up- I will depool db1163 again to make db1139:s1 a direct replica in a safe way. I wonder if I should trust my code or do it manually? @marostegui, did you ran into issues with move-replica (I don't use it frequently) [15:52:34] should be fine jynus [15:52:58] cool. I will still do it with all depooled just to be extra safe [15:53:01] you mean our own move-replica? not orchestrator's one right? [15:53:17] db-move-replica on cumin- I think that is the one? [15:53:25] yeah should be fine [15:53:29] no issues [15:53:46] thanks, I think it's been years since I used that so I am super rusty [15:55:31] BTW It took >2 days to recover a dump now- although it was done remotely and with smaller memory, but it has certainly grown from previous runs [15:58:42] jynus: check the logs cause replication broke due to grants coming from above, so that might have extended the replication catch up. I am not sure how long it was broken for. but double check it [15:59:03] I fixed it but not sure how long it broke for [15:59:08] ah, I saw mentions on that, but I was counting the myloader part only [15:59:23] the lag catchup took also a lot of time [15:59:35] ah ok then :) [16:01:57] I will be doing anyway a quite thorough comparison later, as the plan is to make this the canonical place to backup enwiki [16:02:07] sounds good [16:19:23] going to reboot ms-be2059 as the replacement drive came back as sdaa not sdr [16:19:38] https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_rings says to reboot in this case. [16:29:45] * Emperor lols at Applying configuration version '(d8fbd3c38f) Jbond - Revert "Revert "Revert "Revert "mx2001: disable ldap validation""""' [16:29:56] This Puppet is Reverting [16:30:27] he he, someone had the record of reverts at some point, cannot remember who [16:32:38] and did I told you the story of when I got the award of "most active open source contributor" in my country or region, by doing pools and repools of dbs (0 net changes), when it used to be managed in git? [16:34:56] I had not heard of that one :D nice [16:36:26] people tried to thank me and spin it as a good community member, and I, that I knew it was not a great system, saying please don't, this is something I hate doing (this way, rather on a dynamic config) and it has 0 positive impact on open source! [16:50:37] new drive filling OK. [17:14:26] everything looking fine on db1163/db1139, repooling host to go back to normal state