[07:30:20] <_joe_> Krinkle: I think it's more probable that aqs is not working properly [07:30:35] <_joe_> also neither aqs nor mw are on k8s mainly, so not sure what's the relationship [08:20:12] _joe_: I assumed aqs was nodejs and all node in k8s by now. Do we still have scb hosts or something? [08:21:07] I haven't checked recently but I recall seeing a few weeks ago that it used a pretty old version of most npm devs including in-house packages [08:21:18] deps* [08:22:01] Krinkle: o/ there are some dedicated (bare metal) aqs nodes, where the nodejs app + cassandra run [08:22:25] IIRC there was some work on a new version to be deployed on k8s, but not sure what is the status [08:22:34] I'd ask to the Data Platform's SREs (cc: btullis) [08:24:47] in addition to aqs, we also use Node outside of k8s for maps, restbase, aphlict (the Phab notification server) and testreduce1001 also has some node-based things [08:25:24] elukey: ah they run Casandra on host? Interesting. I think I came accross aqs hosts at some point as a Cassandra cluster. Did not think it ran node as well [08:28:33] Krinkle: I think it is a pre-k8s thing, nowadays we wouldn't have a cluster like that [08:34:52] I think the medium-term plan is to have nodes that run Cassandra run nothing else [08:36:49] and aqs 2.x on k8s right? [08:42:15] pass (I come at this from the data-persistence angle, so I know more about simplifying cassandra management than other-bits-of-aqs), sorry [08:42:21] (but sounds right) [08:48:29] sure sure :) [08:48:44] I didn't mean that you had to know it, only if you've heard about it :) [11:48:04] Krinkle: I've added my first-pass thoughts to the ticket, with a description of the plumbing. I agree with .joe. that it's probably AQS misbehaving, but I haven't found a smoking gun yet. [11:52:16] <_joe_> btullis: mine was just a guess, based on historical tendencies of what might break :) [11:55:02] Yup. If I were a betting man, I would not bet against it :-) [12:37:19] Krinkle: regarding that hackathon discussion, it didn't look like packet loss after all IIRC, but rather that the producer (envoy on the mw side) would close the connection in some rare cases due to having reached the connect_timeout limit (250ms). The error rate was rather low, the other side of the conversation was in k8s. But IIRC we just added [12:37:19] 1 retry to envoy config, which solved the issue as it apparently was some load related heisenbug (we could replicate it with the host depool from traffic). [12:38:18] in this case aqs is outside of k8s and the lack of attention and lack care for it throughout the years makes it more probably it has something to do with AQS itself instead. [12:38:47] btw, it's just 1 more restbase installation, so it suffers from the exact same problems as restbase. [12:39:04] nodejs app + cassandra app coexisting and talking to each other on the node. [14:19:42] morning folks, in the process of upgrading our rec servers this week, progressively rolling out [14:19:49] if you see any DNS issues, you know who to blame! (me) [14:19:55] just please let me know as well :) [14:20:48] akosiaris: ah, I thought your theory was that the timeout was reached as a result of packet loss. because we have no record of the connection ever making it into the actual service being called, and 250ms seems like a long time to spend "just" proxying to and from a k8s pod. [14:21:29] you probably said something different than packet loss, but I recall that term as that's pretty much the only way I know something can get lost, but maybe there's something else instead. [14:23:02] akosiaris: the issue with retry then is that it becomes a very expensive and slow failure mode. 250ms is basically >100% of the latency budget for most mw requests or maybe 25% for edits. That's a lot to spend in a single parser function for syntax highlighting. [14:23:22] plus another 1-2ms for the retry. [15:15:09] Just put in a Ganeti VM request . I'm going to start building the VMs now, but please let me know if there will be a problem https://phabricator.wikimedia.org/T341705 [15:22:56] thanks for the approval jbond ! [15:26:11] no probs :) [15:32:22] so I am mostly curious, but are we doing approvals for VMs now? [15:32:34] the reason I am asking is because in the past I have just created a task and the VMs as well [15:33:09] I might have missed the memo on this and hence the question! [15:36:14] I tend to seek approval for deploying VMs like inflatad.or did, but mainly as a courtesy and to keep a standardised record of the request. Sometimes I receive guidance as to where to locate them or similar. [15:37:39] I think the related docs is https://wikitech.wikimedia.org/wiki/SRE/SRE_Team_requests#Virtual_machine_requests_(Production) [15:43:21] What b.tullis said. I'm new and trying not to breach protocol (more than I already do by being myself ;P ) [15:49:48] volans: I might have something interesting for you [15:49:57] last time I asked the rule was "you can self-serve VMs but please do keep making those request tickets, even if you resolve them yourself, we want them for paper trail and capacity planning" [15:50:05] running the reimaging cookbook on a Ganeti VM and it keeps on rebooting to the installer [15:50:08] seen this before? [15:50:16] 301 slyngs [15:50:20] so use the template, but then resolve yourself, if you are just replacing an existing VM for the same role [15:50:28] if completley new thing, maybe wait for approval [15:51:15] haha [15:51:21] slyngs: ^ if you are around, not urgent [15:51:39] I haven't seen that so far, and the cookbook does set it back to disk [15:51:52] does it happen just on one or on multiple VMs? [15:52:02] so far just one [15:52:22] and you have done other VMs [15:52:29] previously, yes, no issues with them [15:52:35] this is the first one I am doing with bookworm though [15:52:50] checking that [15:53:24] have you already reimaged to bookworm other hosts with the same role? [15:53:33] no, this is the first one [15:53:55] recommends putting them into 'insetup' role, use cookbook with bookworm.. then change role to prod role [15:54:13] did it reach the "Host up (Debian installer)" message? [15:54:26] have you looked at the console [15:54:26] ? [15:54:34] yeah [15:54:38] I am attached to the console [15:54:42] Host up (Debian installer) [15:54:42] ----- OUTPUT of 'gnt-instance mod...6001.drmrs.wmnet' ----- [15:54:42] Modified instance durum6001.drmrs.wmnet - hv/boot_order -> disk [15:55:03] let's see this time [15:55:43] mutante: interesting, but this worked for bullseye before just like I am doing for bookworm [15:55:49] ok [15:55:58] this time it moved on [15:56:01] weird [15:56:01] let's see [15:57:12] fwiw: checked netboot.cfg. confirmed it has "durum*" that should match them all [15:57:50] yep, that was the first thing I checked as well [15:57:54] habit :P [15:57:55] sukhe: maybe the role needs a change to add bookworm support? some package version? [15:58:15] testable on cloud VPS possibly [15:58:28] mutante: it seems to have moved on after doing the installation twice for some reason [15:58:32] let's see how it goes [15:58:38] and also how the nextone [15:58:45] ok, ack, good [16:01:31] 11:07:52 <+logmsgbot> !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host durum6001.drmrs.wmnet with OS bookworm [16:01:37] 11:57:59 <+logmsgbot> !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on durum6001.drmrs.wmnet with reason: host [16:01:49] it also usually doesn't take this long so yeah [16:01:51] let's see [16:02:29] right now sitting doing this: [16:02:35] > sit back, relax, and enjoy the wait [16:02:36] :) [16:02:45] (first Puppet run) [16:05:48] so the noop run worked, that means that the catalog compiles [16:22:30] finished successfully, albeit the package issue which was expected [16:22:38] let's see if it happens in the next one [16:22:44] ack [16:43:50] okay, I'm a bit confused by this warning here [16:43:51] https://phabricator.wikimedia.org/T341711#9009652 [16:43:55] Management interface not found on Icinga, unable to downtime it [16:44:13] does it matter? maybe it got removed [16:44:40] dbproxy1013 in icinga seems to exist and quite grumpy at me as well [16:45:27] I get these as well and ignore them. my reasoning being that it's the management interface and will be taken care when hardware decommissioned [16:45:29] you can ignore it, but now that you point it out I want to check one thing [16:45:49] as the checks might have been migrated to AM, I don't recall if the migration was completed [16:47:04] thanks! [16:51:51] only mgmt on icinga seems to be switches at this point [16:52:23] https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?search_string=mgmt [16:52:41] yep I think the migrationw as completed and it's all in AM now [16:53:48] I'll send patches... [16:54:12] :D [17:03:54] speaking of patches, I'm working on deploying a net-new service (zookeeper) for the first time. If anyone has any suggestions/example Puppet patches LMK [17:04:34] (there's already a zookeeper, but we're standing up our own unique cluster for search update pipeline) [17:05:40] in a perfect world the module/zookeeper can be used by both and all the config is in profile classes, so you just need to use that module in a new profile [17:06:36] Amir1: sukhe: I've sent a quick patch because of an error I've seen in the cookbook, but doesn't 'fix' the issue. For that I'll need to make a patch to spicerack as the problem is that it's trying to downtime on both icinga and AM and the assumtion was that hosts must exists in icinga and can have checks in AM [17:07:07] But with the migration of the $host.mgmt ones to AM completely there are no more hosts defined for $host.mgmt in icinga and hence it fails first, before being able to downtime it in AM [17:07:17] seeing that there is modules/zookeeper/templates I am thinking one move at teh beginning might be to move it to profile/templates/ .. as a noop for existing prod service [17:07:35] but at the same time I can't fully ignore the icinga error as it's the only place that protect us from typos and such as we're blind in AM if we're matching anything or not [17:07:49] so I'll need to do some refactoring there [17:08:05] for now I'll send a fix specific for the cookbook so that it does the AM part only [17:08:47] mutante thanks for the advice, I'll start with that [17:09:18] volans: thanks! I think given that it was just this specific warning, I was mostly ignoring it given that I had seen no other issues [17:09:21] but thanks [17:14:03] yw :) both patches sent https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/937509 and the previous one in the chain. FYI I'll be off the rest of the week, so if someone will be able to merge and test them that would be great [17:16:03] * volans logging off [20:29:34] hi all anyone know what profile::ipmi::mgmt is used for. the only thing it really dose is install /usr/local/sbin/ipmi_mgmt (modules/ipmi/files/ipmi_mgmt.sh). its used by the puppetmaster and cumin host. [20:29:53] however i dont see the command being used in the puppet repo, spircerack or any cookbooks [20:30:03] suspect it may be something legacy? [20:31:37] robh: papaul: wonder if this is perhaps used by dc-ops? [20:36:03] i dont offhand recall its use anywhere [20:36:40] ack thanks robh [20:52:20] unrelated i just noticed a pcc feature i forgot about. if you use the ./utils/pcc interface you can pass `-f | --fail-fast` and the pcc job will bail out on the first failure. usefull to weed out any simple issues