[07:01:53] as a fyi, s1 switchover is postponed due to T350656 [07:01:54] T350656: dbconfig bug - "2 instances found for query ..." - https://phabricator.wikimedia.org/T350656 [08:05:52] arnaudb: have you tried to set the scope? [08:06:04] yes with and without [08:06:22] also Amir1 checked and found nothing obvious [08:13:38] arnaudb: it works for me on both cumin1001 and cumin2002... [08:17:04] commentd on task [08:20:38] wtf x) thanks for checking out volans [08:20:58] I did not checked on etcd at the time [08:21:56] I wonder if there was any object mangling by other script that might have creaed a potentially matching object... not sure though if that's at all possible using dbctl (it probably checks itself before creating it) [08:22:54] let us know if this happens again or you can repro running some more higher level script [08:23:05] if you have some leads to suggest in case it happens again thursday, I'd be glad to check them out [08:24:56] also conftool searches on etcd with "match regex ^db2103$" so I doubt it could match anything else... [08:25:54] i'm puzzled! [08:36:14] it's not something to do with the weight setting? [wild wild guess, but that's the difference in the runes in the bug report that do and don't work] [08:36:48] i.e. a.rnaudb was setting to 0 and v.olans was setting to 295/300 [08:37:44] Emperor: it could be but it's weird as I used the same command just last week with marostegui when we did s4 🤔 [08:38:08] it's the get() that's failing because it finds 2 objects [08:38:13] has nothing to do wih the weight [08:39:01] if you tried with --scope too that also exclude that for soem reason an obejct for this db was present also in the eqiad tree [08:39:41] my first guess was a "naming mistake" yeah [08:39:55] the only thing I did not check and afaik neither did Amir1 is etcd [08:43:53] arnaudb: how many times have you tried the set-weight? and when have you tried the 'get' in relaation to the set-weight? [08:44:09] because IMHO the 'get' should have failed too... [08:44:31] get worked, we both tried a few times [08:44:41] edit worked as well [08:46:33] after the get worked the set-weight was still failing? [08:46:41] did you run the commands from which cumin host? [08:51:11] commands were run from cumin1001 in sudo from my user account, I tried the set-weight before and after indeed [08:59:28] weird... right now I can't come up with a logical explanation... let's see if j.oe has some additional ideas on the task [09:00:25] is it possible that etcd was in the red zone at this moment and returned an exception that was interpreted as 2 results for dbctl? [09:01:08] it's the only caveat I'm seeing, I did not perform any write with dbctl edit, only reads [09:01:54] that's what failed: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/conftool/+/refs/heads/master/conftool/extensions/dbconfig/entities.py#87 [09:01:59] line 90 there [09:02:30] and if you follow the code [09:02:31] query["name"] = re.compile("^{}$".format(name)) [09:02:59] so even without scope that should be unique and you tried wih scope too [09:05:42] could we try to ouptut `results` to see what's returned in that exception? [09:06:14] if we manage to repro it... :D [09:07:52] I'm counting on my usual luck on this to cross paths with this issue again :D [09:08:01] rotfl [09:08:11] that's the spirit! [09:08:35] will send you a patch :) thank you volans [11:50:10] We have this warning: Last dump for m3 at codfw (db2160) taken on 2023-11-07 02:37:33 is 83 GiB, but the previous one was 92 GiB, a change of -9.4 % [11:54:16] it seems to be mostly phabricator_file.file_storageblob [12:38:09] volans: for me, everything failed without scope but only edit-weight failed with scope too [12:55:20] The thing is volans, that is the future master of enwiki, we really need to figure out what's going on otherwise things can break really bad [13:16:41] Amir1: so even a simple "dbctl instance $name get" was failing? [13:16:50] yup [13:17:40] I'm running a get on all instances to see if I can repro.. [13:18:20] at this point you probably want to get conftool maintainers involved ;) [14:08:20] I have raised the priority of that task to high [14:11:02] I've added a possible explanation [14:12:01] just saw it volans, that's indeed a possible explanation [14:12:31] what's the status of dbctl now? is it only broken for that instance? [14:26:32] no AFAICT it works with all [14:26:35] including that one [14:27:10] I think because arnau.db might have deleted the eqiad instance at some point during the debugging [14:27:39] so the AFAICT the current status is all is working and as long as the instances are unique across all DCs all works fine [14:28:06] to make -s/--scope work on most instance methods we need to fix dbctl as it has a bug that doesn't pass the scope for writing actions [14:28:17] (and probably doesn't test for it eithe) [14:36:37] right [14:36:55] what should be the next steps? [14:37:28] we probably want to implement something to prevent those situations [14:39:55] just fixing the -s/--scope so that it actually works would alredy prevent to be a blocker in any situation [14:41:05] the mapping between dc and hostnames is very wmf-specific so not sure if we want to add it there or not (I don't recall how much dbctl is wmf-specific or not) [14:41:27] but that's all IMHO, I'll leave it to the conftool maintainer for the final decision ;) [14:41:39] *maintainers [14:42:23] Right, let's coordinate on that task [14:42:27] Thanks for the help volans [14:42:31] de nada [15:11:16] Amir1: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/972393 [15:11:19] you okay with that? [15:11:33] let me check [15:12:21] it is just to reboot pc1012 [15:13:20] From what I'm seeing it's getting replication from the original pc2 so keys shouldn't be displaced [15:13:24] that's good [15:13:52] yeah [15:13:54] I moved it last week [15:13:56] to keep it warmed [15:14:03] once done, I will move it to pc3 [15:14:06] To do the same thing [15:14:41] awesome [15:14:44] thanks <3 [15:17:05] Amir1: https://gerrit.wikimedia.org/r/c/operations/puppet/+/877205 thoughts? [15:17:48] marostegui: yeah, I want to do it but I want first the misplaced rows to be purged first [15:17:59] otherwise it'll grow way too much [15:18:02] as long as you have it on your radar, I am good [15:18:12] i.e. I want to merge it in exactly three weeks [15:18:30] yeah, definitely [15:18:45] thanks :) [15:19:41] :***** [15:19:53] <3 [16:11:07] thanks for all the reviews marostegui ill aim to deploy them tomorrow [16:26:37] cool [16:31:50] Some of them will require a test with a mariadb restart. I want to make sure if something isn't working we know at that moment, instead of knowing weeks later when mariadb is restarted for any other reason [16:31:52] jbond: ^ [16:33:43] marostegui: yes i saw the comments. ill cordinate in here with restarts and confirmations etc [16:33:49] cheers [17:48:17] I wonder who I could add from data persistence about a change in tranfer.py? [17:48:24] *for review [17:48:29] Amir, maybe? [17:55:58] If I can understand it, sure [17:56:05] right now I need to be afk for a bit [20:33:13] pc1014 moved to pc3 [22:54:33] Amir1: I want to close https://phabricator.wikimedia.org/T291767, especially after your last comment there, any objections? [22:58:29] marostegui: sounds good to me