[08:44:29] morning [08:45:22] o/ [09:12:14] dcaro: thx for the ceph HA doc, useful info ! I left a few comments [09:12:36] XioNoX: awesome :), thanks! [10:33:07] 👀 [11:59:09] topranks: could you please take a look here when you have a moment? https://gitlab.wikimedia.org/repos/cloud/cloud-vps/tofu-infra/-/merge_requests/132 [12:01:42] arturo: sure I'll go through it later today [12:01:49] thanks! [14:27:11] * arturo food break [14:38:25] blancadesal: I'm still struggling with the parts of my brain that were erased during sabbatical. iirc I had talked to someone (I think you?) about getting a striker dev environment set up in order to take over some striker updates, that was you wasn't it? And if yes, are you still available to work on that? [14:48:16] andrewbogott: is there any reason why we cannot or should not migrate DNS zones in "noauth-project" to "cloudinfra"? [14:48:21] I'm thinking in particular of the zone "db.svc.eqiad.wmflabs." [14:49:00] no, if you want to do the work of migrating them that would be great. [14:49:28] it contains only a handful of records, and they are legacy ones that should eventually be deleted [14:49:46] I want to make them cnames to the canonical ones instead of viceversa [14:49:52] sure [14:50:03] Are names for new VMs still landing there? [14:50:20] (if you have the patience you can document while you work on https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/DNS) [14:54:13] andrewbogott: I'm pretty sure no new VMs are landing there, but there are a few records that are still actively used [14:54:50] dhinus: ok, that's good :) Are you thinking about migrating /everything/ out of there (including VM entries) or just the svc domains? [14:55:00] it's a bit messy, I think I will make an umbrella task for all of "noauth-projects", and then subtasks for each zone [14:55:31] I would like to migrate the "db.svc" zone now (which is separate from the "svc" one), but the other ones can wait [14:57:49] bd808: I'm looking at T369308 and noting that the codfw1dev striker database is also on that server. Of course codfw1dev striker hasn't worked in ages... do you have any intuition about whether there's valuable data there? (And I guess the larger question is, 'should we just give up on ever having codfw1dev striker?') [14:57:50] T369308: Decommission clouddb2002-dev.codfw.wmnet - https://phabricator.wikimedia.org/T369308 [14:58:43] andrewbogott: You're brain is not wrong, that was me :) How time-sensitive is it? There are just a couple weeks left before the offsite to wrap up the Q, then I'm essentially out until the second week of January. [14:58:49] your* [15:01:01] andrewbogott: PTR records for new VMs are still landing in the zone "16.172.in-addr.arpa." :( but the other zones should be "simple" to migrate [15:01:26] that could definitely use some cleanup :( [15:02:22] the easiest cleanup would be declare "wmflabs" is no longer supported and nuke everything :D but I'm sure too many people/tools are still using it [15:02:40] I'm making a task [15:02:42] blancadesal: It's not super urgent although now we have at least one user-facing issue waiting on a striker update T380384 [15:02:45] T380384: [toolsadmin] Striker cannot create Developer accounts with names matching existing SUL accounts - https://phabricator.wikimedia.org/T380384 [15:03:10] dhinus: I was kind of hoping that the buster deprecation killed off the last of the wmflabs VMs but probably the dates didn't quite line up [15:05:52] there are not that many left, about 10. plus the "svc" records [15:06:56] oh so close [15:08:20] andrewbogott: I see :/ at some point we will have to deal with striker... would you be okay with this waiting until after xmas et al. ? [15:15:03] yeah, I think that's Ok. it'll lkely take a long ramp up in any case [15:20:19] 👍 [15:35:05] Anyone else unable to connect to VMs on ssh? [15:36:42] working for me -- what VM specifically? [15:36:44] Rook: i can login but it's very slow [15:37:02] Quarry and paws seem mostly inaccessible, web and ssh, though hypervisors and horizon seem alright. [15:37:04] wikistats-bookworm.wikistats.eqiad1.wikimedia.cloud is playing up for me and toolschecker went off [15:37:35] I may be seeing the same as RhinosF1, login.toolforge.org is trying to connect but hasn't yet [15:38:01] Oh things may be starting back up... [15:38:04] i eventually got in but it was a lot slower than normal [15:38:20] Rook: https://phabricator.wikimedia.org/T380489 ? [15:40:49] Yeah it is seeming like things are very slow, or intermittently accessible [15:41:12] Quarry loads now, PAWS still is having trouble starting servers, though things are a little more responsive on ssh now [15:41:36] oh, that is not nice, looking into ceph [15:42:28] ceph looks to me like it's happy now but was briefly hung up [15:43:01] it's still complaning about losing ping [15:43:03] *pings [15:43:33] lots of errors on switches [15:44:00] Ceph cluster in eqiad has 53 slow ops <-- alert [15:44:07] hmm... got kicked out of irc for a minute [15:44:09] https://usercontent.irccloud-cdn.com/file/CVGC62SL/image.png [15:44:19] d5-e4 connectivity [15:44:23] topranks: ^ [15:44:47] there was a network outage [15:44:48] https://usercontent.irccloud-cdn.com/file/h0kSSELy/image.png [15:44:58] (as in ceph nodes lost contact with each other network-wise at least) [15:45:05] looks more like faulty equipment than "outage" but yes [15:45:40] yep, I mean like a network going down for a few min [15:46:26] topranks: can you check the switch logs? it seems it's back online [15:46:52] there's only 1 slow op left on the cluster [15:46:55] (it's recovering) [15:46:57] seems to be recovering [15:47:52] checking it's showing on the 'inbound fcs errors' on this dashboard: [15:47:52] https://grafana-rw.wikimedia.org/d/f61a7d56-e132-44dc-b9da-d722b11566cf/network-totals-by-site?orgId=1&refresh=30s&var-site=eqiad%20prometheus%2Fops [15:48:01] cloudsw1-d5-eqiad et-0/0/52 [15:48:49] https://grafana.wikimedia.org/goto/iO3lhOnHR?orgId=1 [15:48:54] seems to have stopped which is weirder [15:49:08] usually these things start and that's that - faulty optic [15:49:12] I wonder if there is anything on the switch logs [15:49:29] like a crash of the data plane or anything [15:49:39] nah [15:49:48] it's either the optics or the fibre [15:49:57] maybe someone working there bending fiber and it bent back [15:50:51] it's a physical layer issue anyway, either transmitting or receiving optic or the fibre itself (though ususally fibre problem just means it goes dead) [15:51:16] I see these logs in librenms [15:51:18] https://usercontent.irccloud-cdn.com/file/Ro6nwELG/image.png [15:52:47] andrewbogott: I'm not sure the Striker in codfw1dev ever worked, but maybe t.aavi did get it running at some point. I don't have any particular attachment to it. [15:53:24] bd808: ok, I think I will archive and switch off the db and let striker die. thanks [15:53:30] we had some slow operations on ceph also last week, though it was way lower and it had come back up when I looked I did not follow up [15:53:32] but https://grafana.wikimedia.org/d/5p97dAASz/network-interface-queue-and-error-stats?orgId=1&var-site=eqiad%20prometheus%2Fops&var-device=cloudsw1-d5-eqiad&var-interface=et-0%2F0%2F52&viewPanel=43&from=1731974400000&to=1732060799000 [15:53:46] https://usercontent.irccloud-cdn.com/file/fh8Zb9iG/image.png [15:53:53] something also happened on the switch then [15:55:22] (it was actually earlier this week, on the 19th) [15:56:01] same link [15:56:07] it failed today as well as showing the errors [15:56:12] patten was ramp up of errors [15:56:16] then link fully down [15:56:20] link restoration [15:56:22] some more errors [15:56:25] then errors clearing [15:56:37] that time it was not long enough to cause issues though [15:56:42] Nov 21 15:33:36 cloudsw1-d5-eqiad fpc0 Local fault detected on port 65 (et-0/0/52) [15:56:42] Nov 21 15:36:01 cloudsw1-d5-eqiad l2cpd[17035]: LLDP_NEIGHBOR_UP: A neighbor has come up for interface et-0/0/52. Now, this interface has 1 neighbor/s . [15:57:06] 6-8 mins last time yeah [15:57:08] topranks: what would you suggest we do to correct/prevent? replace the optics? [15:57:25] yeah that's all you can really do, I'm just checking to see if either side shows particular signs of an issue [15:57:27] over-heating perhaps [15:57:29] and the fpgs maybe? (if the optics were broken, they would not come up?) [15:58:24] toolforge is still alerting even though ceph seems to have recovered [15:59:14] andrewbogott: I don't see the alert on alert.w.o, where do you see it? [15:59:15] andrewbogott: I don't see any toolforge alert at the moment [15:59:42] nothing jumping out in the stats for them [15:59:43] https://librenms.wikimedia.org/device/device=241/tab=port/port=25169/view=transceiver/ [15:59:50] https://librenms.wikimedia.org/device/device=185/tab=port/port=23290/view=transceiver/ [16:00:28] dcaro: no faulty optics regularly cause these "brown out" issues :( [16:00:46] ok, I might just be seeing an old page [16:01:10] in theory it could be the SERDES and other parts they connect to in the switch, but ultimately that's all internal backplane of the switch, so if it's not them it's a matter of replacing the device [16:01:15] topranks: what if the connector overheats? I remember having some beefy ones that needed a fan of sorts (back in the day) [16:01:20] 99.9% of the time this kind of thing are the modules [16:01:36] yeah often when this happens you'll see correlated temp rise over time [16:01:41] ack [16:01:45] which is why I checked here, might point to one or other side [16:01:47] but temps are fine [16:01:58] so we should replace both ends [16:02:58] ack, can it be done without downtime? (draining the full rack is going to take some time) [16:03:57] yeah it can be done anytime [16:04:17] for now I'll shift the instances traffic on to the C8<->E4 link to make that the active path for that [16:04:31] awesome :) [16:09:17] the question is should we disable that link completely or not for now? [16:09:44] it's probably safest, these things only tend to get worse not better [16:11:15] let me take a quick look actually [16:11:29] sounds good to me, there's no impact on the traffic right? (well, only the lack of HA until fixed) [16:11:59] hey, I noticed netbox complaining of cloudcephosd1025 being active there but not on puppet FYI [16:12:16] no worries if it is a temporary thing, just in case you hadn't noticed [16:12:52] or maybe it has been decommed/being serviced and can be marked as failed/decomed, etc. [16:13:27] jynus: thanks, it's temporary out, we sent the hard drives to do some investigation to dell [16:13:46] can I mark it as failed? [16:14:17] jynus: what does that entail? (it's not really failing, but if that helps sure) [16:14:35] it is just inventory, it helps clear the dashboard of unexpected things [16:14:48] aka makes dashboard go green for sres [16:15:01] go for it then, I'll add a note on the task to change the state when it comes back [16:15:12] so it is just editing it on netbox following the server lifecycle policy [16:15:32] https://netbox.wikimedia.org/dcim/devices/3980/ [16:15:46] it helps not forget about servers :-D [16:16:16] thanks! [16:16:20] dcaro: actually instead of shutting it down I just changed the BGP policy across that link [16:16:23] these are the docs: https://wikitech.wikimedia.org/wiki/Server_Lifecycle#Active_-%3E_Failed [16:16:46] so it won't be used - but it'll stay up and should we get really unlucky and the C8 -> E4 link has an issue it'll get used again [16:16:54] don't worry about forgetting, if it goes back to puppet and it is failed, it should alert again [16:17:22] thanks for your help! [16:19:15] topranks: awesome, thanks :) [16:25:46] ok traffic is completely off that link [16:25:49] https://grafana.wikimedia.org/goto/YiF3AdnHR [16:26:33] spikes of 250kpps over it, so the 8kpps is over 3% loss [16:26:44] which is next to unusable tbh [16:32:31] I created https://phabricator.wikimedia.org/T380503 [16:34:47] the rest of the links seem ok, the links to C8 are all busier since the change, but peak is ~10Gbps and they are all 40G links [16:40:56] thanks for checking all that [16:41:15] topranks: ack thanks! [16:57:47] * arturo offline [17:42:01] andrewbogott: not sure if it's the right process: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1093968 [17:43:42] assuming that tag exists and is built, that seems fine although typically I would deploy to codfw1dev first [17:44:03] it's two patches stack, one for codfw, the last for all [17:44:43] oh, so I see! Looks right then [17:45:11] I'll have to go in 15 min, so if anything breaks you might have to pick up the pieces :/, is that ok? [17:45:27] (I can leave it to you if you prefer, so you are not rushed in case it fails) [17:45:46] I'll do the deploys so I'm watching when it breaks :) [17:46:16] ack, okok, I'll leave it for you :), let me know how it goes (added notes in the task on how I got there) [17:48:22] * dcaro off cya!