[07:17:55] restbase2018 is botched, there's hung kernel processes (possibly related to mdadm) and all nodetool commands (and other commands doing I/O hang), I'm going to powercycle it via mgmt [08:20:42] Amir1: about switching VE on enwiki: can you check that the panels I made for checking X2 health on the VE dashboard are actually measuring the right thing? I'm a bit unsure about the disk space one. It looks suspiciously flat... https://grafana.wikimedia.org/d/OxxOv5K4k/ve-backend-dashboard [08:22:59] thanos query: sum(node_filesystem_avail_bytes{site=~"codfw", instance=~"db2142.*", mountpoint="/srv"}) [08:23:49] ...and the same for db1152 [08:24:34] Based on the current graphs, x2 seems completely unaffected by small and medium wikis now using it to stash VE state. [08:26:03] Given that it shows 8TiB free, even the full estimated 140GB requirement would be a 2% blip [08:28:15] The estimated additional 100 writes per second would also barely be noticeable compared to the sustained load of around 7000. [08:40:44] It all sounds safe to me, what do you think? [10:33:35] duesen: why do you think it would 140GB to x2? [10:50:56] I honestly think checking for free disk space of x2 is not the right metric, if we dump large stuff in x2, before it gets filled, we will have way bigger problems [11:35:24] Amir1: x2 holds the main stash, right? [11:35:46] sure [11:36:04] mainstash for edits shouldn't hold large blobs for long time [11:36:05] A couple of month ago I asked Eric to check the utilization of the stash store on the restbase cluster [11:36:28] The result is documented in https://phabricator.wikimedia.org/T320536: [11:36:34] "Based on the metrics about the stash backend used by RESTBase, we estimate a need for about 140GB of storage capacity for the VE edit stash. This is based on the measurement of 100 writes per second and a 24 hours TTL, with an average of 20KB per entry." [11:36:55] This is not permanent storage, it's a 24h stash [11:37:08] We discussed this at length half a year ago, I believe [11:37:17] when we deployed maintstash on x2 it was half a gigabyte [11:37:41] yes, i know... [11:37:49] 24h is way too long tbh [11:38:04] Then we need to find an alternative solution quickly [11:38:15] Can you put your concerns and ideas on https://phabricator.wikimedia.org/T320536 ? [11:38:44] sure [11:39:27] We can easily configure VE/parsoid to use any other BagOStuff. Even a different one per wiki. [11:40:21] I think we discussed the option to set up a separate db for this on the cluster that holds ParserCache. [11:41:10] Does that sound like a good idea? How hard would that be, how long would it take? [11:41:46] (switching over is annoying... VE session would lose context, edits would fail...) [11:42:29] Amir1: apart from all that - what would be a better health metric than free disk space? [11:44:11] max replag in x2 [11:45:41] I'm not saying we shouldn't be using x2, I'm saying it needs a bit tuning before switching over [11:45:55] write ops in x2 is also a good metric [11:46:50] Yea, I'm already tracking write ops. Can you point me to an example for tracking replag? [11:48:28] Ah, found it, I think [11:58:48] Amir1: do you think we can try to swicth enwiki on monday and see how it impacts the stash? Or would you want to block this on setting up a dedicated stash db? [11:59:18] I'm happy with letting enwiki go but we might need to revert it right after if metrics don't look good [12:00:19] this is good to keep an eye on https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=db1152&var-datasource=thanos&var-cluster=wmcs&from=now-1h&to=now [12:01:31] I think all of x2 is not backed up on disk and read and writes are memory only so your limit is around 350GB [12:01:51] which should be fine [12:03:58] Ok, let's try it. We designed the component in VE to allow is to safely switch back without losing data. [12:14:33] Amir1: So, what's plan B? Making a separate stash DB somewhere else? [12:17:22] I need to think about it tbh [12:25:24] duesen: my 100% honest opinion is that x2 will work but 1- needs optimization like compression and later reducing ttl based on data 2- I'm skeptical of 140GB [12:29:54] */but/because [13:20:05] Amir1: I have asked Peter Pelberg for data, let's see what he comes up with. But what if we can't reduce the TTL? [13:21:37] Then I'll spend some time to figure out whether it's still sustainable to use x2 and if not, what would be the replacement [13:22:00] and if it's removed once edit is saved, I think it's fine to store it for 24 hours [13:37:30] We currently don't have a mechanism to "unstash" something. Knowing when exactly which stashed data can be deleted is a non-trivial problem, and the deletion will have to be looped through several layers of abstraction. Doable, but quite a bit of work. [13:38:08] Also, probably not worthwhile [13:38:34] 90% of edits are abandoned. So the removal would only be possible for 10% of the stashes anyway. [13:39:02] ...because we only get a signal when people actually save the page. when they close the tab or hit the back button, we don't. [13:55:30] duesen: there is also session recovery when you re-open the tab, and localStorage support within VE to recover your transaction after a crash. [13:57:44] Amir1: I think the main thing that's different here is that this has zero relation to MW core's "edit stash" feature. This is not about stashing "edits". This is storing the Parsoid output as-is of the page revision that is current before you start editing. In wikitext, the equivalent to this is simply the timestamp and revid in the form field, because the text is trivial to retrieve from externalstore upon saving wikitext edits. For [13:57:44] VisualEditor, the equivalent to a stable merge base is the Parsoid HTML which Parsoid then safely processes in a clean diff (avoid dirty diffs). Made further difficult by the fact that oldids cannot be deterministically parsed once non-current, and by the fact that ParserCache is optimised to store last revision only. [15:40:10] arturo: I just happened to see that DNS change fly by and by the time I looked at it, it was already there [15:40:18] but, at least at a glance, I think it may be problematic [15:40:29] :-( [15:40:39] I would suggest reverting it right quick and then we can discuss a bit more, just in case? [15:40:45] sure [15:41:01] cc topranks [15:41:37] ok [15:41:46] basically, from the point of view of remote recursing caches on the internet, this may cause your domains to sporadically not work for a while until various TTLs timeout, etc [15:42:12] (or may not, the results are unpredictable, but there's a safer way) [15:42:36] (revert merged) [15:42:49] bblack: the existing NS IPs are still working [15:43:01] so queries will get answers on either [15:43:06] I'll try to be succinct and not go into the 10 pages I normally do [15:43:14] https://phabricator.wikimedia.org/P49389 [15:43:14] but basically [15:43:30] 10 pages on dns works for me :P [15:43:47] :-) [15:43:59] the NS records for these zones (foo NS some.server), and the corresponding A-records elsewhere (some.server A 1.2.3.4), those are separate things in the remote caches [15:44:08] they can be picked up at different times and expire at different times [15:44:26] yeah indeed, we probably should not remove A/AAAA records for ns0/ns1 immedaitely [15:44:36] I see [15:44:39] so if you simultaneously both swap the hostnames the NS points at *and* delete the address records.... [15:44:49] some will have the old NS, but fail lookup of the old A, and other such scenarios [15:45:02] yeah, that certainly makes a lot of sense [15:45:03] add the new A records for "ns" and "ns-recursor" at the same time we change the NS records to point to those [15:45:07] basically you need to make just the NS-swap first, without deleting the old A-records, then wait out some TTL periods, then delete the old A-records [15:45:15] then wait and remove the A/AAAA records for ns0/ns1 after a while [15:45:24] yep yep makes sense [15:45:27] thanks! [15:45:57] (and in the case of NS stuff, I would wait quite a while. You also need to update the registrar, and some bad caches out there might take longer than TTL dictates, etc) [15:47:39] and while digging at the registrar part, I got confused, too [15:47:54] I'm not sure we need to update any registrar? Wikimedia NS are still authoritative for the registered domains, nothing needs to change in the TLD zones I think? [15:48:04] nevermind, in my haste I thought these were top-level NS [15:48:27] so yeah, the whole "delete the A-records later after some multiple of TTL" still applies, but ignore all that stuff about registrars :) [15:48:28] no worries, yeah these are just delegations from WMF DNS for sub-domains to the WMCS name servers [15:49:40] it's one of those rules of dns that's mostly undocumented that everyone discovers eventually. [15:49:48] * arturo nods [15:50:31] cool thanks. FYI codfw has no live users, so the damage isn't catastrophic if something breaks. Want to avoid that still, but it's also probably led us to be a bit less diligent prepping the changes. [15:50:35] in cases where the DNS lookup is basically a chain of separate records (NS->A, or CNAME->anything, etc) - be careful with change/delete of both at the same time. You have to assume the world will contain mismatched copies of the linked records for a while. [15:51:14] bblack: I've noticed that in the past yep. Any insight into why? Like common DNS software that doesn't do what it should? [15:51:29] no, it's really how the protocol works, and just something you have to be operationally careful of [15:51:32] I guess you cannot simply force a third party into forgetting some records [15:51:33] specifically I've noticed NS entries cached beyond TTL [15:52:00] even *after* learning this lesson more than once, I once briefly broke wikipedia with a CNAME change of this sort :) [15:52:43] yeah, the issue of caches not always honoring TTLs correctly in various ways, is a whole other issue on its own that exacerbates [15:52:59] but even if you assume the whole world strictly obeys the TTLs you hand out, you still have the dependent-records problem. [15:53:14] even if the two records have the same TTL [15:53:49] basically, even if you give two records the same TTL, and you think they're commonly always queried together, you can't assume caches will always have them stored with the same TTL, because: [15:53:54] yeah so if the A record cache expires before the NS one you're in trouble [15:54:33] 1) They can be looked up independently of each other for other reasons at different times. Random query of the A-record by someone for something before this cache happened to query the NS record that lead to it, or whatever. [15:55:01] 2) Cache eviction: even if they entered a cache together with the same TTL, one might later be evicted earlier than the other to make room for some unrelated data, thus breaking the sync of their TTLs [15:55:23] add in that some caches are layered on top of querying sets of other caches, and it's a mess [15:55:28] so you just can't ever assume [15:55:48] ok [15:55:59] I guess the approach should be ok if we can: [15:56:11] 1) Ensure both the old and new IPs respond to queries simultaneously [15:56:43] 2) Add the new ns.xxx A records and change the NS delegation to those new names, but don't touch the old ns0/ns1 A records [15:57:07] 3) Wait a silly time and check what we see coming in to the old IPs in terms of queries, before finally removing ns0/ns1 A records [15:57:21] right [15:57:38] for a less-critical case, a multiple of the largest of the related TTLs should be fine without monitoring [15:57:46] but yeah, if you're really paranoid, monitor traffic [15:58:19] a small multiple I mean, like 2x or 3x just to be safe-ish [15:58:32] this isn't too critical, but these network changes will be replicated in eqiad at some point so we want to use the exercise to learn how best to do it there [15:58:51] right, makes sense! [16:01:55] thanks for the assistance bblack, really appreciated. We will also make notes for this when we do the same in the eqiad1 deployment [16:02:09] which will definitely have way more customer impact [16:03:08] np! [16:06:59] bblack: if you had a moment could you look at this patch? [16:07:00] https://gerrit.wikimedia.org/r/c/operations/dns/+/928620 [16:07:28] basically the same but we are leaving the A/AAAA records for the old NS server hostnames in place to cover for the case they are still cached pointing to those names [16:09:14] [done, +1] [16:09:51] great, thanks for catching that one! [16:22:37] bblack: this is more for curiosity but I think there are some (pre-existing) broken things in this setup [16:22:55] nsX.wikimedia.org servers are auth for wikimediacliud.org for instance [16:23:16] They delegate codfw1dev.wikimediacloud.org to the WMCS servers [16:26:09] actually sorry, I'm getting mixed up between wikimedia.cloud (delegated) and wikimediacloud.org (not delegated) [16:26:14] ignore me! [16:35:16] * topranks finds the answer in RFC7816 appendix A [16:44:22] :) [16:44:48] I timed my mufaletta-building process perfectly to miss the whole thing :) [16:46:35] haha :) [20:04:20] hi! I need some help [20:04:32] I had to revert a puppet patch and I need someone to help merge it and run a cookbook [20:04:39] https://gerrit.wikimedia.org/r/c/operations/puppet/+/928927 [20:05:08] https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/AQS#Deploy_new_History_snapshot_for_Wikistats_Backend [20:18:14] milimetric: I can do that. [20:24:51] milimetric: Restarting aqs daemons now with rolled back version of the snapshot https://sal.toolforge.org/log/bn7UoYgBhuQtenzvA4QT [20:57:36] thanks very much, all's well now with the revert