[00:46:22] it's deliberate self-harm to degrade our capacity factor by switching off multi-DC for two weeks per year [00:46:24] "The reasoning behind this is to make sure that the data center receiving all the traffic can survive an entire week's traffic patterns." [00:46:45] in other words, throw it in the deep end and see if it drowns [00:47:03] maybe it will survive, maybe not, but at least we will know, right? [00:47:52] and of course we do not have any additional monitoring for that one week period, we are just relying on something to page if it fails [00:48:19] caches will be cold in the secondary DC so we are potentially setting ourselves up for hours of downtime in the event of a traffic spike [00:56:14] surely there are better ways to measure our capacity factor than just depooling servers until something breaks? [00:57:24] Tim I found your comments surprising, for as long as we have done these test, I remember finding out about a new service that was broken in terms of multi-dc/depooling/different dc logic [00:57:59] sometimes major, somtimes minor [00:59:03] I would be fine with it if you depooled it for a short period, and had a checklist to go through, and repooled it when the checklist was finished [00:59:05] and in terms of performance penalty, I think there was some miliseconds of regression, probably due to cloud lack of redundancy and geographic shift, but nothing major [00:59:28] I am just complaining about leaving it depooled for 7 days while nobody is doing anything, just waiting for something to break [01:00:33] so what I am trying to undestand is what major harm is happening that is not worth the check/way in which we can do better? [01:00:42] *ways [01:01:18] I'm not concerned about the measurable performance penalty [01:01:50] I'm concerned that somebody important will die while one DC is depooled, and we'll run out of FPM workers or something, and by the time we fix it, it'll be old news [01:02:54] you only get a window of a few hours if you want to participate in delivering breaking news [01:03:10] so I have 1 counter question and one followup: nothing is permanent -precisley this helps us get more familiar with change and dynamic status [01:03:35] and what if that happens while right now codfw was depooled, or esams was depooled? [01:03:50] we have to be ready precisely for that! [01:04:41] in other words- the alternative is to not checking and find out then, and not in advance! :-D [01:05:23] but again, I would like to understand better existing pain points to improve- I am not invalidating what you say [01:06:52] e.g. I am sure am lacking a lot of your perspective [01:07:19] normally a disaster does involve an intersection of unlikely events [01:07:34] for example, accidentally degraded redundancy combined with a traffic spike [01:07:38] yeah [01:07:54] so I think you are afraid of that being more likely in a degraded state? [01:08:09] "degraded" (meaning with an inactive dc) [01:08:38] two weeks a year of degraded redundancy means we have full redundancy for 96% of the time [01:08:49] my (and I belive many of my workmate's) point is that we should self-degrade more often, not less [01:09:19] if we had two days a year instead, we would have full redundancy for 99.5% of the year [01:09:20] as the issue almost every time is human triggered [01:09:38] and this particular intersection of events would be 7 times less likely [01:10:07] I think that is an unfair view- although you don't have to agree with me [01:10:30] redundancy != uptime [01:10:55] and assumes otherwise we are 100 % available [01:11:26] when in reality, for example, the dc switch avoid lots of db downtime due to simplified procedures [01:11:45] and allow for a more reliable network due to network upgrades [01:13:06] I'm fine with network upgrades too -- any kind of active process which benefits from depooling a DC [01:13:22] just not sitting around doing nothing with a depooled DC, waiting for failure [01:13:29] maintenance and human error are more common causes of downtime than hardware failure or other external factors, but I don't have the figures in front of me to show it [01:14:10] that's where I think you are a bit unfair :-D [01:14:32] s/waiting for failure/doing important upgrades to infrastructure/ [01:15:02] the rest of the time codfw can be primary and eqiad secondary as it is our normal topology [01:15:22] you are saying that network upgrades will take exactly 14 days per year? [01:15:37] actually I think more than that [01:15:42] let me check [01:16:56] https://phabricator.wikimedia.org/T327248 8 days in total last time [01:17:12] but of course we had to account for failing days and give a margin [01:17:24] and we usually don't make people work on weekends [01:17:31] (sometimes we do, sadly) [01:18:07] so that's for network [01:18:20] schema changes on master take a large time, too [01:18:53] maybe you weren't aware of how complex those operations can be? [01:19:30] I was watching closely last time, I made the same criticism in #mediawiki_security [01:20:01] I meant the network upgrade, not the dc switch :-D [01:20:11] if you are saying that the wikitech page is incorrect, please just update it [01:20:33] let me double check what it says [01:20:40] I quoted it before, it doesn't say we leave the DC depooled for network upgrades and schema changes [01:21:18] "The reasoning behind this is to make sure that the data center receiving all the traffic can survive an entire week's traffic patterns." [01:21:21] I am guessing it will be simplified [01:21:29] and not intended to give all details [01:22:02] I'll restate my position again [01:22:26] I am saying that it is fine to depool a datacentre to allow necessary work to take place, for the duration of that work [01:22:46] I am not OK with depooling it just to see if the other one holds up under the traffic [01:23:34] because some day the answer will be no and I will be sad about that [01:24:22] So I think I justified the first part, as far as I remember, there is always maintenance schedule for it- I will try to clarify it on the docs [01:25:27] my question about the second is: and please note I am not trying to be combative, just understand better the pain, what should we do instead? [01:25:50] I agree there is risks involved [01:26:21] plan for a period of single-DC operation but make it flexible instead of a fixed 7 days [01:26:47] once maintenance work is complete, repool the secondary DC [01:26:48] ok, see- now that is useful and actionable- I wasn't understanding before [01:27:00] sorry about my missunderstanding [01:28:32] I belive in this case it will be quite flexible, as it will be left like that for 6 months (not single-dc, but primary-codfw) [01:29:31] regarding the actual single-dc time, I guess shrinking it is doable, but it may stress a lot of people, as it takes broadly 3 days to do all switches [01:30:54] would you want to see services, dns and mw being done on a single day? [01:31:22] and the repool secondary the following day, for example? [01:31:38] sure, something like that [01:32:57] it's less concerning to have a DC depooled if there is a way to repool it at short notice [01:33:11] that is useful feedback- whether it is actionable, will depend on resources and teams avaialbilility and advice, but I can transmit that desire with no problem (with obviously no promises, as there will be reasons not to do it so fast) [01:33:48] TimStarling: I belive actually that is the reason why it is staged- so it is easier to revert [01:34:39] if the network is unplugged, you probably can't revert at a moment's notice, so best to minimise that kind of downtime [01:34:41] But it will have to be someone else to tell you why it was designed like that, and the feasability of changing it [01:35:30] as for schema changes -- I guess you can kill an ALTER TABLE if there is an emergency [01:35:33] that last sentence I didn't get it [01:35:49] netops don't kill network for longer than needed! :-D [01:36:30] in fact, it is in many cases (both for db maintenance and network) made longer so it can be either reverted or reenabled faster [01:36:51] in other words, a longer period makes it easier to reenable it in the middle [01:37:21] and I belive that is what is happening too for the dc switch- it is not fully commited in the middle, and can be partially reverted [01:37:59] response time is shorter when there is active work, with people actually online and watching things [01:38:23] when you leave it depooled over a weekend, response time will be longer if you have to page or wake people up [01:40:04] I don't disagree with your statements, but I don't see how that is relevant- failure can (and actually happens) on weekends anyway [01:40:25] most of the change happens from tuesday to thursday [01:41:04] but if the point is- let's no leave X depooled during the weekend [01:41:15] I think that is again, very reasonable feedback [01:41:33] I don't want to make things too onerous -- after all, we are just returning to a situation which was normal a year ago [01:41:45] and I can also note it to bring it to the coordinators [01:42:14] I just want to make sure we have good reasons to reduce redundancy and take reasonable steps to minimise the single-DC period [01:42:23] if I may, and sorry if I am being too personal, I think you are having too much fear yes [01:42:47] as you say it was the normal case a few years ago [01:43:01] and we depool entire dcs relatively often [01:43:15] and we have to live with that :-D [01:43:36] the personal reason I have for raising this is that I spent a lot of time making multi-DC work, and supposedly the main benefit of that work was redundancy [01:43:51] So I am more in the midset of "let's embrace failure and chaos" [01:43:58] and be ready for it [01:44:03] so to say that redundancy doesn't matter seems to undermine the perceived utility of that work [01:44:09] rather than "let's not touch things for fear of breaking them" [01:45:03] I have my own opinions about that- but that is a longer conversation that I don't want to open so late for me :-D [01:45:25] in other words, what was the point of doing multi-DC if we are not going to value it sufficiently to make sure it stays enabled for 99% of the time? [01:45:26] (of course I see multidc as positive) [01:49:15] I think we have different views of availability, but we should continue this conversation over a soft drink, next time we meet :-D [11:09:59] TimStarling: FWIW, it's not really just a test, just for this switchover, I have three different tasks I'm planning to do in September since eqiad will be depooled (multiply that by number of SREs), there will be a lot of maintenance happening as well. It just wasn't communicated [12:04:39] TimStarling: Let start by saying thanks for providing this feedback. Now to try and answer some of your concerns. The paragraph you are quoting for the reasoning was indeed incomplete. Risky maintenance work is clearly spelled out as a motivation in the Background section (and other sections) of the doc, but I mistakenly omitted it in the [12:04:39] reasoning. Thanks for catching that (it would have been awesome to catch it back when the doc was being circulated, but better late than never). I 've submitted https://wikitech.wikimedia.org/w/index.php?title=Switch_Datacenter/Recurring,_Equinox-based,_Data_Center_Switchovers&diff=prev&oldid=2108097 to fix it and point out that other wise risky [12:04:39] maintenance work will be happening too during that time frame. More than 1 team in SRE have pointed out they intend to use it and have already scheduled work. I think it's important to point out that we can always adapt to changing circumstances. If maintenance work stops happening/slows down/takes considerably less time during those time frames [12:04:39] (for whatever reason), we will adapt the Switchover process. e.g. we stop having 7 fixed days and move to something more suitable (whatever that may be) [12:05:16] As to the cold caches thing, we were pretty worried about that too last time (like every time). We always had issues with cold caches when switching data centers, as you probably very well remember. This time around, we didn't need to run the pre warming part of the process and it went fine. We believe this is mostly due to Multi-DC work, which we [12:05:16] are pretty thankful to you about. [12:05:51] For the 96% redundancy part, I am not sure if this is a helpful metric. It is only providing me with an estimation of an opportunity for something bad to happen, not any certainty. Error rates to end users across days/weeks/quarters/years matter more to me. [12:06:15] But in any case, we know pretty well that in case of an emergency (e.g. a celebrity death as you describe) we can pool the drained DC pretty quickly (and we aim to become even faster and confident at that, that's part of the point of making this happen more often). By quickly, I mean minutes, not hours. So even if that opportunity materializes, we [12:06:15] will react and react fast. [13:01:22] Worth running a rebalance on the ganeti cluster or nah? [13:01:33] I was about to ask [13:01:40] 12:59:40 +icinga-wm │ PROBLEM - Ganeti memory on ganeti1019 is CRITICAL: CRIT Memory 95% used [13:01:48] Too bad moritz.m just left lol [13:02:07] akosiaris: opinion ^ ? [13:02:09] does it mean that one or more VM on it are causing the increase in memory pressure? [13:02:29] elukey: I think it may have something to do with the reboots [13:02:43] I'll check if hbal has a dry-run option [13:02:53] the list of VMs on 1019 https://phabricator.wikimedia.org/P52239 [13:05:04] catching up on the multi-dc convo a bit :) [13:05:20] elukey: Major consumer reported by the alert is search-loader [13:05:24] the top 3 (by memory usage) qemu instances are an-test-presto1001.eqiad.wmnet an-test-druid1001.eqiad.wmnet and search-loader [13:05:35] claime: :) [13:05:41] lol [13:05:43] 1) I agree sitting around for a week or more with a core DC depooled for "no reason" isn't great, but I suspect even at every 6 months, in practice, we'll have enough maintenance to justify ~a week anyways during that period. [13:06:46] 2) But: what I do like to push back on, is the idea that the secondary DC is open to random per-cluster offline maintenance/upgrade/whatever for extended periods. That's what really eats up our supposed redundancy budget. [13:07:00] claime: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-7d&to=now&var-server=ganeti1019&var-datasource=thanos&var-cluster=ganeti&viewPanel=4 [13:07:17] after the reboot 1019 doubled its usage [13:07:35] not sure if Moritz did some rebalancing while at it [13:09:04] other than the short time when one's offline for truly-necessary maint that can't be done any other way (e.g. network upgrades), I feel like it's crossing a line when our attitude is more along the lines of "oh eqiad is secondary these 6 months, so let's take the eqiad restbase offline for a month or two for upgrades and offline experiments". Clusters should maintain normal online operations, and [13:09:10] should aim to go through changes and upgrades in an online fashion on both sides regardless of the equinox cycle. [13:09:34] elukey: I think it needs a rebalance, allocated memory has actually decreased, so I think it's just a matter of spreading the load better [13:09:37] otherwise all we end up reliably having is a maintenance-failover cycle, rather than true redundancy when we need it. [13:09:48] elukey: https://grafana.wikimedia.org/goto/R20m4FzIz?orgId=1 [13:10:38] could be yes [13:10:52] I am not getting if after the reboot it got more VMs or not [13:11:00] otherwise I don't explain the "Used" memory bump [13:16:21] elukey: hbal isn't planning on removing anything from 1019 [13:16:25] so err [13:17:13] idk [13:17:14] claime: never done it before, but if there is a way to move specific vms away from 1019 we could relocate search-loader [13:17:39] claime: elukey: if the reboots are over at the cluster, an hbal call in a tmux (or via --submit-jobs will slowly rebalance it) [13:17:40] or better, one of the an-test node [13:17:52] this happens after every ganeti cluster reboot [13:17:57] ahhh TIL [13:18:17] I think it's in the docs, let me check. If it is not, I 'll add it [13:18:31] It is [13:18:40] akosiaris: how does we explain https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-7d&to=now&var-server=ganeti1019&var-datasource=thanos&var-cluster=ganeti&viewPanel=4 ? [13:18:51] The question is, hbal isn't moving any VM from the one host that is alerting for memory pressure [13:18:51] more VMs or just something else? [13:18:53] x) [13:18:58] https://wikitech.wikimedia.org/wiki/Ganeti#Cluster_rebalancing [13:20:25] elukey: more VMs probably [13:20:34] I see 13, more than any other node in the cluster [13:20:43] I only checked for the nodegroup, maybe rebalancing the whole cluster would make it better [13:20:50] I can go through jobs to correlate timeframes [13:21:23] nah if it is time consuming we can definitely skip it [13:21:35] not sure if Moritz is done with the reboots though [13:21:45] " On an multi-group cluster, select this group for processing. Otherwise hbal will abort, since it cannot balance multiple groups at the same time." [13:21:57] One host left to reboot in eqiad, because puppetdb can't be evicted [13:21:59] (from the task) [13:22:25] node: ganeti1028.eqiad.wmnet [13:22:25] 13:22:08 up 52 days, 23:24, 0 users, load average: 5.54, 6.11, 6.51 [13:22:33] yeah, so he isn't done yet [13:22:34] yep that's the one [13:23:03] do we some immediate problem? Or just the alert? [13:23:22] Just the alert [13:23:25] IIUC just the alert [13:23:42] btw, that node is primary for all it's instances and not a secondary for any. [13:23:52] It's got 4GB ram left, should be enough to run OS and hypervisor overhead [13:27:11] ok, migrating then e.g. an-test-druid1001.eqiad.wmnet to the secondary should fix the alert and not harm anything [13:27:16] doing so now [13:27:21] Fri Sep 1 13:27:15 2023 * memory transfer progress: 25.43 % [13:27:50] Thanks :) [13:31:44] done and indeed https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&from=now-15m&to=now&var-server=ganeti1019&var-datasource=thanos&var-cluster=ganeti&viewPanel=4 says we are ok [14:33:12] XioNoX are you working on CODFW network stuff ATM? I'm using the makevm cookbook and I see a netbox diff for "Loopbacks EVPN Underlay Codfw" [14:35:18] topranks: ^ related to the new switches? [14:38:31] inflatador: em yep that's a new range I created a short time ago [14:38:51] 99% certain it's good to proceed, probably that makemv cookbook calls the DNS one? [14:38:53] is there a diff? [14:38:54] topranks cool, will confirm then [14:39:01] y [14:39:14] will paste, but it looks OK [14:40:10] https://phabricator.wikimedia.org/P52242 [14:40:49] inflatador: ah ok thanks for that [14:40:52] yeah it's good to go [14:41:23] v-olans is aware of the issue with netbox/DNS/etc mixed msgs like that, I think he's working on a solution [14:41:33] I should have realised it would cause that automatic hiera addition, next time will make sure to run cookbook myself after the netbox addition [14:42:03] yeah and it's the edge cases - like this addition which would be rare - that I forget exactly what will trigger [14:42:23] I swear I'm a magnet for it. Almost every time I run the cookbook I get this ;P [14:43:25] but yeah, hopefully soon we'll be able to provision VMs without crosstalk [15:21:47] Lately when I do 'puppet cert clean' on puppetmasters I sometimes get a 'Error: header too long' message. Does that ring a bell for anyone?