[07:47:13] I've replied on task... [07:59:36] hi folks! As part of my attempt to break kafka, I am slowly migrating kafka-main nodes to PKI certs [08:00:00] as FYI for the on-callers --^ [08:00:43] if you see a client that complains (hopefully not but it may happen), the 99% probable reason will be that it uses a CA bundle containing only the puppet cert [08:15:34] folks I was trying to reimage kafka-test1006, and the first attempt didn't work, afaics the host booted again in buster (it is a ganeti VM) but with no net interface working, and I discovered that in /etc/network/interfaces it was mentioned "ens5" and ifconfig showed "ens13" [08:16:07] fixed the interfaces file, ifup, and the next reimage seems to be working [08:16:25] has it ever happened to you that a ganeti vm interface was renamed like this? [08:17:17] elukey: is that T273026 ? [08:17:17] T273026: Errors for ifup@ens5.service after rebooting Ganeti VMs - https://phabricator.wikimedia.org/T273026 [08:17:18] or maybe systemd is making fun of me [08:17:54] volans: mm not sure, in that case the interface was up after booting, it was just a race condition causing a failed unit [08:18:00] IIRC [08:18:51] at which stage the reimage failed? [08:19:17] it's a different issue from T273026 [08:19:24] it didn't boot into d-i, it kept waiting for it so I went in the ganeti console and I found the issue [08:19:39] we've seen that before a few times, from the point of the Linux kernel [08:19:55] the "PCI bus slot" in the VM changed [08:20:11] the root cause is unknown [08:20:31] and it's not really a consistent phenomenon either [08:20:38] it might be related to the fact [08:20:55] that for some old VMs we changed the KVM machine type after their creation [08:21:12] ahh okok [08:21:19] so I am not crazy [08:21:39] on paper we still use pc-i440fx [08:21:57] which is a quite old mainboard well supported in practically all OSes [08:22:08] but there's a version tag attached to it by QEMU [08:22:22] so e.g. pc-i440fx-5.2 for QEMU 5.2 [08:22:38] even though there are no material changes in QEMU to this machine type [08:22:52] and to ensure an upgrade path between QEMU relesaes [08:23:18] since otherwise we could not migrate a VM created with QEMU 6 to a node running QEMU 5 [08:23:34] we configure specifically 5.2 for all VMs [08:24:07] and my hunch is that this issue triggers on old VMs which were once created with plain "pc-i440fx" [08:24:18] perfect yes.. would this change be triggered by a simple reboot of the VM? [08:24:22] and then we default changed to "pc-i440fx-5.2" [08:24:29] they haven't been rebooted in more that 100 days [08:24:35] and with the reimage this then appears as a different OS [08:24:50] it only happens on reimages when you [08:25:04] install a new OS on an existing QEMU environment [08:25:21] so these will phase out slowly over time as we decom and re-create [08:25:55] Fix is essentially https://phabricator.wikimedia.org/T272555#6768676 [08:25:56] ok but what fails then? IIUC the vm didn't even go into d-i, something else failed and (I guess) the VM tried to boot into buster [08:26:04] but with ens13 [08:27:04] elukey: for kafka-test1002 maybe try the following: [08:27:20] - reboot on the QEMU level using sre.ganeti.reboot-vm [08:27:39] - if this shows the interface rename, fix up e-n-i [08:27:53] - following that the reimage should work [08:28:10] perfect yes makes sense [08:28:16] if not, happy to have a closer look [09:33:28] moritzm: tried the reboot with kafka-main1007, renamed s/ens5/ens13/, and it booted fine [09:33:37] so this validates your theory [09:33:38] :) [09:34:36] moritzm: let me know when you have 1 minute to coordinate followup to https://phabricator.wikimedia.org/T306602 [09:35:37] sure, we can do it now [09:35:47] elukey: great :-) [11:23:45] slyngs: ok to merge your puppet change? [11:23:55] Yes, sorry [11:24:02] ok merging [11:24:06] Thanks [14:34:13] SRE on-calls: o/ I've completed the rollout of PKI certs to kafka main eqiad, the codfw cluster is still running the old Puppet certs. Lemme know if you see anything weird :) [14:34:45] elukey: thanks for the heads up [15:03:32] hey folks while checking metrics I noticed somenthing weird about eventgate-main (that uses kafka main eqiad) [15:03:35] https://grafana.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&var-service=eventgate-main&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=thanos&var-site=eqiad&from=now-2d&to=now [15:03:52] I checked the config and it uses the wmf-ca-certificates bundle [15:04:12] but the drop looks a little weird [15:05:35] ottomata: --^ [15:06:15] in the pod logs I see "Encountered rdkafka error event: broker transport failure" but nothing more [15:08:39] I am going to roll restart the pods, let's see if it is a temporary weirdness of the rdkafka consumer [15:14:42] ah wait ok, https://sal.toolforge.org/log/2a1fTIcBxE1_1c7siaie [15:15:01] akosiaris: IIUC active/active services are still depooled from eqiad right? [15:15:22] elukey: no [15:15:27] they are pooled [15:16:51] ah I see your user was not the one repooling them, okok I misread the sal [15:17:00] but the drop matches very well the depool [15:17:08] at least afaics [15:32:04] ah snap it seems changeprop [15:32:05] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&from=now-2d&to=now&var-dc=eqiad%20prometheus%2Fk8s [15:33:45] hnowlan: o/ [15:33:56] do you have a minute to verify with me if I am crazy or no? [15:33:58] *not? [15:34:13] yoyo [15:34:31] :) [15:35:00] so on the 4th at around 12:30/13:00 UTC I see a big drop in processing events in changeprop eqiad [15:35:13] that matches when we depooled services for the row C maintenance [15:35:55] (need to step afk for a few, will be back in a bit) [15:37:13] elukey: yeah looks like there's a backlog building up [15:37:20] changeprop codfw took over some jobs but not all [15:38:53] maybe, not clear :/ [15:45:19] Dunno, looks like the resource_change topic switched over as expected actually [15:45:29] the drop makes sense [16:00:09] here I am sorry [16:01:25] hnowlan: worth to roll restart changeprop in eqiad? [16:03:02] elukey: can do [16:03:56] but does it not make sense that the resource_change has switched over to codfw? [16:20:03] sorry last meeting :) [16:20:31] hnowlan: yeah in theory yes, but afaics it seems that it switched when we depooled eqiad via cookbook, that doesn't make any sense [16:20:54] (depooled for network maintenance) [16:28:44] hnowlan: for example, eqiad.resource-purge dropped as well [16:30:19] going to roll restart the pods just in case, so we don't think about it anymore [16:30:42] oh I did that earlier [16:30:54] ah snap didn't get it [16:31:04] no big deal [16:31:20] ah yes now I see the metrics popping up [16:31:58] yeah, that stupid bug :/ [16:32:08] I assume the disparity relates to the mw rw api being cooled in codfw but not eqiad [16:32:13] trying to unravel the jobs though [16:37:41] hnowlan: we can probably restart tomorrow, nothing seems really on fire, but it looks weird [16:38:00] eqiad.resource-purge is still showing some events [16:38:06] but way less than before [16:38:25] so it seems as if, for some reason, changeprop was processing a ton of things in eqiad before the depool [16:39:58] yeah something is a bit weird with that [16:40:00] It wasn't from 28/02 until 14/03 then it was again [16:40:03] 0 makes sense [16:40:07] but *some* doesn't [16:40:17] This is weird [16:40:36] also check this [16:40:37] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1680600460342&to=1680624554781&var-dc=codfw%20prometheus%2Fk8s&viewPanel=8 [16:40:57] some rules in changeprop-codfw got way busier at the same time [16:41:57] mostly resource-change related [16:42:36] the dip in on-transclusion-update-change-prop-transcludes-resource-change is worrying [16:43:21] most other rules look ok [16:44:24] should restbase-async be active/active? I don't recall [16:44:38] restbase-async Active/Active pooled pooled [16:45:10] yes yes, I recalled that it ran on one DC only in the past, this is why I asked [16:45:10] According to the service catalog, yes [16:45:15] oooh [16:45:17] yeah [16:45:19] good catch elukey [16:45:32] it should have been depooled from the primary dc [16:45:42] It's ok that it's pooled in both iirc [16:45:56] But we switch it to the secondary so it doesn't slow down the primary [16:46:07] Let me do that real quick [16:47:16] done [16:47:44] waiting for ttl now, as usual [16:48:35] super thanks! [16:49:00] oooh [16:49:02] there's a difference [16:49:06] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=now-15m&to=now&var-dc=eqiad%20prometheus%2Fk8s [16:49:19] yeah I was about to say [16:49:25] That's a found rabbit :p [16:49:40] because I noticed that restbase traffic dropped in eventgate at the same time [16:49:43] so I wondered [16:50:02] okok makes more sense now [16:50:14] hnowlan: --^ [16:50:14] good catch both of you! <3 [16:50:54] so nothing really happened, it was mostly running in both dcs etc.. [16:51:02] but the graphs looked weird :) [16:51:23] SRE: We make the graphs tidy [16:51:28] :P [16:51:34] :D [16:51:57] thanks all for the support! I think that it is the time to stop typing, see you tomorrow :) [16:52:33] o/ [20:26:29] If no one objects, I will depool sessionstore in eqiad in the next 10 minutes or so to conduct some experiments (see: T327954) [20:26:29] T327954: session storage: 'cannot achieve consistency level' errors - https://phabricator.wikimedia.org/T327954 [20:27:30] cjming: fyi as you’re deploying ^ [20:28:04] also cc denisse, mutante ^ [20:28:08] scap is hanging for me [20:28:33] should i just hang on or is there something i can do? [20:29:17] err... that depool would be in codfw, not eqiad :/ [20:29:47] i'm deploying on deploy2002.codfw.wmnet [20:30:01] cjming: hanging on which step? [20:30:37] taavi: the initial scap cmd "scap backport xxxxxx" [20:30:41] cdanis: ACK, looking. [20:31:25] cjming: you sure? looking at the processes on deploy2002 I see a scap backport process for your user that's in the docker image build stage [20:32:04] oh - sorry - it's hanging on after i entered y at the prompt "Backport the changes?" [20:32:53] what is the last output you see from it? [20:33:13] so this is the last output: "20:24:13 K8s images build/push output redirected to /home/cjming/scap-image-build-and-push-log" [20:33:23] ok, yes, you are at the image build stage [20:33:29] that taking a while is expected [20:33:46] well, not expected, but it sometimes does it and it's normal at least for now [20:33:57] oh - i haven't deployed in a while - my memory of it is that it shouldn't take this long? [20:41:33] does anyone (that's around) know: should sessionstore.discovery.wmnet be pointing to eqiad right now? [20:43:12] I assume that is to blame for the rather uniform looking traffic split (https://grafana-rw.wikimedia.org/d/000001590/sessionstore), as opposed to the typical split between primary/secondary data centers [21:03:51] urandom: as an active/active service it should follow the pooled/depooled state, so normally when both are pooled it should resolve to eqiad in eqiad and resolve to codfw in codfw [21:04:08] oh. [21:04:17] which tracks what I see atm: [21:04:17] rzl@cumin1001:~$ dig +noall +answer sessionstore.discovery.wmnet [21:04:17] sessionstore.discovery.wmnet. 266 IN A 10.2.2.29 [21:04:17] rzl@cumin2002:~$ dig +noall +answer sessionstore.discovery.wmnet [21:04:17] sessionstore.discovery.wmnet. 272 IN A 10.2.1.29 [21:04:41] that makes sense, I should have looked at this, but then why is it currently so evenly split? that's not...typical [21:05:31] normally the "inactive" datacenter only sends a small amount of traffic [21:07:03] I don't know sessionstore patterns super well -- is that still expected to be true now that mediawiki is active/active for reads? [21:07:27] it's been active/active for a while, hasn't it? [21:07:29] yeah [21:07:33] well before the switchover [21:07:38] definitely [21:07:47] this is recent, since the 14th [21:08:22] naively I'd expect reads from both DCs to be talking to sessionstore, but-- oh! that explains it, the 14th is when we repooled MW eqiad for reads [21:08:26] when we switched to codfw, everything went there (expected), when we made eqiad active for sessionstore, we ended up with ... well, a larger share than codfw actually [21:08:43] so "since the 14th" is the normal active-active pattern [21:09:08] so it's the result of being active/active while codfw is the "primary"? [21:11:55] yep, although I think you'd see about an even split regardless of which DC was primary [21:12:17] the primary DC only matters for writes, and there aren't enough of them to put a dent in the stats [21:16:31] hmmmm, it sure looks like you're right and I'm wrong though! https://w.wiki/6YEN [21:17:03] so something changed, either intentionally or otherwise, in the course of the switchover -- it *might* just be which DC is primary or it might be something else about traffic routing to sessionstore [21:17:37] (here's that same graph) https://usercontent.irccloud-cdn.com/file/hktwh2ra/image.png [21:25:24] rzl: auh, yes, that illustrates it nicely [21:25:38] heh, so yes, THAT is what I mean :) [21:26:01] yeah, sorry to take a minute but I'm with you! I'm not sure where that's coming from [21:26:11] I'm less gung-ho to depool a data-center and experiment when so many more clients affected [21:27:03] well, if you depool sessionstore in eqiad and wait out the discovery DNS TTL, all that traffic should get routed to codfw and you can experiment away [21:27:15] and, if it *doesn't* all get routed there, that will be extremely good to know [21:28:03] we'd pick up the inter-DC latency for sessionstore obviously, but this ought to be a supported operation [21:29:29] as long as you eyeball that graph again and it does all shift after the depool, IMO you shouldn't need to worry about how much traffic you moved [21:53:32] rzl: yeah, it's the increased latency that lowers my enthusiasm [21:53:48] nod, fair enough