[01:55:48] FIRING: PuppetFailure: Puppet has failed on ms-be2096:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [02:05:48] FIRING: [2x] PuppetFailure: Puppet has failed on ms-be2095:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [06:06:03] FIRING: [2x] PuppetFailure: Puppet has failed on ms-be2095:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:12:20] those are nodes still with DC-Ops for setup. [08:14:08] 👍 now that I am aware it doesn't worry me [08:16:45] looks like Dell sending us not-blank disks again, but I'll have more of a look after some caffeine [08:37:21] non-blank...?! [08:53:09] I'm restarting instances on db2239, it is consuming unexpectedly too much memory [08:55:09] yeah, I presume it's some artifact of internal testing, but we keep finding that one of the spinning disks in Dell Config-J systems has a vfat filesystem on (looks like an EFI partition from Windows) [08:56:32] Our (re-)imaging of swift backends doesn't wipe the spinning disks (because we want to preserve them during upgrades), so this means puppet can't run to completion because it wants to make an xfs partition on each spinning disk, but won't overwrite what's already there. [08:58:26] wait, so you use xfs for swift? [08:58:51] (unrelated question) [08:59:20] yes, hence my comments about the y2038 bugs, poor repair tooling and slight concerns about performance when nearly-full [08:59:44] ok, that changes everything [08:59:49] ? [09:00:39] for me, sorry, this is an unrelated train of thought- sorry, I don't have any helpful tip re:partman [09:01:27] I may use ext4 instead [09:02:03] as I was seeing degraded performance exactly as I was close to full drive [09:05:18] Meh, fixing these two servers is fine - wipe the offending disk properly, re-image. I'll talk to dc-ops about asking Dell again to actually send us empty disks. [13:41:02] Could I get an approval / 👍 on https://gitlab.wikimedia.org/repos/data_persistence/swift-ring/-/merge_requests/17 please? teaching the ring manager about some more codfw racks [13:46:30] also to https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250591 to add the hosts to the rings [13:48:25] Emperor: looking [13:58:52] thanks :) [14:25:51] Raine: so what was the question you had for the dbas? [14:26:05] jynus: would you be interested in reviewing https://gitlab.wikimedia.org/ladsgroup/db-password-rotation/-/merge_requests/4 perhaps if you have time? The relevant change is in user_grant_handler.py - the rest is testing etc and I can explain the changes in detail [14:27:50] federico3: sorry, I don't think I have time for that- I have to attend an ongoing outage, do some interviews, attend service ops, keep working on new media project and keep filling gaps for people on vacations [14:28:52] no worries! [14:44:16] jynus: so, for the ICU upgrade (https://phabricator.wikimedia.org/T419049), there is a new process that, unlike the usual approach, would avoid breaking sorting for a week [14:44:38] ok [14:44:45] the idea is to duplicate the categorylinks table, write the new sort keys into it async, and then swap the tables [14:45:15] the questions are, (1) can we do that, if yes, how? and (2) will cold buffer caches due to swapping to a new table murder performance? [14:45:22] who develop that, and was it okeyed by the dbas? [14:46:13] the affected wikis are listed in https://phabricator.wikimedia.org/T419242 and notably do not include commons but do include enwiki [14:46:30] it was developed by Tim Starling, and I'll talk to him about the MW parts [14:46:52] Amir okayed it from the "we can afford to have two enwiki-sized tables" (though not two commons-sized tables) [14:47:12] but I don't think any DB person okayed it from the cold buffer cache perspective [14:47:22] I have no idea about (1), but for (2) you shouldn't care about cache, if you write to the table, it will be hot, what you should be worried about is metadata locking, as any table swap will require a short period of exclusive lock, and if there are select pileps that woudl create an overloand and outage [14:47:35] right, okay [14:48:11] that's useful to know :D thank you [14:48:17] it doesn't have to be a long select, as long as select overlap, new queries could pileup [14:48:40] "at which point do we need to go read-only and restore read-write from the DB perspective" [14:49:16] you don't need to go read only the locks can be set per table, plus the swap should take care of that, the problem is the metadata locking whend doing ddl on a very acive table [14:49:28] ok, that's really helpful, thank you [14:49:37] the good news is that if you do that on the master, it may not have a lot of reads [14:49:46] right [14:49:54] the bad news is that metadata will happen on the replicas [14:49:59] actually, how would propagating that to the replicas work? [14:50:03] so you may want to figure something like [14:50:44] depooling half of the slaves, applying the ddl, letting it replicate, if the pooled slaves overload, pool the unpooled ones as they will have it applied without locking [14:50:49] (maybe?) [14:51:13] my suggestion is to treat that as a ddl request, following the procedure [14:51:22] and I think by monday you will have an assesment [14:51:28] let me link it [14:51:43] hmm, right [14:52:36] just add on the additional ticket Jaime told you to follow the schema change procedure even if you know technicall it is not (but it is the same risks and procedures) [14:52:55] okay, thanks, awesome [14:52:58] I will read up on that [14:53:01] https://wikitech.wikimedia.org/wiki/Schema_changes#Workflow_of_a_schema_change [14:53:24] I belive that is where the template is [14:53:35] it can be the same ticket, doesn't have to be separate [14:53:47] yeah, I see it [14:53:49] but it has all the info needed by the dbas: list of dbs, etc [14:54:32] you don't need to reapeat info that already have on another ticket, just link it [14:55:27] yeah, will do [14:55:30] I am not sure if the automation will have support for that, but it may work, to minimize metadata locking, and if extra risk minimizing may be needed for metadata locking [14:56:01] I can comment what I tell you after you fill in the ticket asking for help, but you will get server faster if the typical format if followed [14:56:08] *what I just told you [14:56:10] yep, thank you [14:56:25] I assume I should do two of them? [14:56:33] basically, the most important part if the context and the list of wikis [14:56:36] *is [14:56:39] one for creating the new table and one for the actual swap? [14:56:45] em, why 2? [14:57:46] ah, no worries, just say it has 2 parts [14:58:02] and make them extremelly clear, just what the dba needs to know [14:58:08] the creatin in fact is super safe [14:58:31] and can be done at any time- I am guessing it has to be done shortly after the swap, so you may want to ask the dbas to do both [14:58:44] but creating tables in production is open to any deployer [14:58:51] and writing there/running maintenace [14:59:06] runnin alert (/rename) is the part that dbas have to do, and what's dangerous [15:00:19] Can I get a +1 on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1250609 please, adding 4 new codfw ms frontends? [15:00:48] actually, it has to be done on the master, cannot be done with the usual replication method, because the change is not forward-compatible [15:01:27] so yeah creation -running maintenance -depooling -running on master- pooling, doublechking no overload, repeating [15:01:37] for small wikis that's usually ok [15:01:52] for large wikis there will be likely lag and has to be actively monitored [15:02:14] so you did right to contact the dbas, it shouldn't be a complex operation, but it is definitly a risky one [15:02:42] Raine: please create the task and I will summarize everything I said here as a comment [15:04:22] Raine: disclaimer, also, please understand I have not done a schema change in 7 years, so while I wrote the page I sent you to, I am not the canonical person decision-wise and they could decide differently [15:05:26] going back now to firefighting [15:07:23] actually, I just realized: because new temporary table will be created, it may require some extra filters for cloud [15:07:42] so that would be the first blocker [15:08:17] or maybe not, I'm not sure, but there will be some strange interaction there [15:23:47] Raine: in terms of timeline how urgent is this? I would recommend having Manuel involved in the conversation when he's back from vacations [15:26:01] federico3: I opined the same, but our manager told me to answer them now [15:26:30] him or am*r [15:40:19] so, it's supposed to be finished this quarter because it blocks next quarter work like debian upgrades [15:41:25] I very much want to wait for M.anuel or A.mir for final go/no go [15:44:38] but it's super helpful to have the context from you jynus , as that means I can start the prep (collecting requirements, planning the sequence on my side etc) now [15:45:37] so thanks a ton <3 I will create the tasks and clarify things in the main task's runbook [15:57:44] please add that Q requirement to the ticket for the dbas [16:16:14] will do, thank you