[06:24:58] I have restarted sirenbot as it was disconnected [06:25:55] !oncall-now [06:25:56] Oncall now for team SRE, rotation batphone: [06:25:56] A.mir1, c.white, s.lyngs, l.mata, x.ionox, r.zl, c.danis, m.arostegui, s.ukhe, q.uestion_mark, v.gutierrez, m.oritzm, a.pergos, v.olans, a.okoth, r.obh, e.ffie, _.joe_, a.kosiaris [07:34:58] I just realised phabricator master is on row A [07:35:09] I am going to try to see if I can switch it before the maintenance [08:18:46] I am going to switch over m3 (phabricator) database master, phabricator will be on read only for around 1 minute [08:21:25] All done [09:20:20] wikikube eqiad upgrade has started btw, already reimaging etcd nodes [09:35:43] we got paged [09:51:07] Was it eventgate-analytics on wikikube that p.aged? [09:51:07] btullis: yes [09:51:07] it just paged again [09:51:07] marostegui: ack it please? [09:51:07] it is acked [09:51:07] I am pretty sure I scheduled 24h downtime for the thing... [09:51:07] yup, I did [09:54:30] <_joe_> !incidents [09:54:30] 3450 (RESOLVED) ProbeDown (ip4 probes/service eqiad) [09:54:30] 3449 (RESOLVED) [7x] ProbeDown (ip4 probes/service eqiad) [09:54:31] 3446 (RESOLVED) PHPFPMTooBusy parsoid (php7.4-fpm.service codfw) [09:54:43] <_joe_> akosiaris: btw you can just use sirenbot to ack stuff [10:12:21] https://wikitech.wikimedia.org/wiki/Vopsbot Docs for that are here. [10:15:51] !incidents [10:15:51] 3451 (RESOLVED) [2x] ProbeDown (ip4 probes/service eqiad) [10:15:51] 3450 (RESOLVED) ProbeDown (ip4 probes/service eqiad) [10:15:52] 3449 (RESOLVED) [7x] ProbeDown (ip4 probes/service eqiad) [10:15:52] 3446 (RESOLVED) PHPFPMTooBusy parsoid (php7.4-fpm.service codfw) [10:35:06] !sing [10:35:06] Never gonna give you up [10:35:07] Never gonna let you down [10:35:08] Never gonna run around and desert you [10:35:09] Never gonna make you cry [10:35:09] Never gonna say goodbye [10:35:10] Never gonna tell a lie and hurt you [10:35:14] :P [10:54:32] * _joe_ whistles [11:02:06] TIL !sing [11:06:58] it was hinted in the docs, IIRC [11:07:56] ...though now I'm remembering the Phantom of the Opera ;p [11:09:28] tbh, I am disappointed it answers to !sing and not !rickroll [11:09:33] hmm [11:09:35] !rickroll [11:09:48] let's fire up git and do an MR [11:17:51] we could teach it a library of suitable songs :) [11:18:00] "suitable" [11:33:06] fuzzy-match the alerts and sing suitable songs. E.g. "DC overheats" -> "We didn't start the fire." [11:33:40] or "the roof is on fire" [11:58:24] how bad a failure results in "Daisy, daisy"? [12:00:54] Several rows lost in all DCs [12:01:31] :) [12:09:16] who owns the API maxlog calculation? I'm trying to work out which team should get T331405 [12:09:20] T331405: Depooled servers may still be taken into account for query service maxlag - https://phabricator.wikimedia.org/T331405 [12:33:26] WMDE I guess [12:40:15] Every time I see WMDE, my mind first parses it as "WindowMaker Desktop Environment" [13:00:39] <_joe_> Emperor: uh that was solved AFAIK [13:01:00] <_joe_> ah no I see [13:05:13] <_joe_> Emperor: you can tag it #User-joe [13:18:05] * Emperor still a bit confused, maybe it just has to stay sat in the clinic queue [13:34:52] Heads up - Eqiad row A network maintenance starting in ~25 mins (T329073) [13:34:53] T329073: eqiad row A switches upgrade - https://phabricator.wikimedia.org/T329073 [13:59:09] Emperor: let me know when you are happy for network maintenance to proceed [14:00:00] topranks: sorry, just finished doing my depools and updated the phab task. You're good from my pov [14:00:12] Emperor: ok great thanks :) [14:00:43] (what do you mean that was cutting it fine before 14:00 UTC? :) ) [14:00:59] topranks: All clear from Data Engineering perspective. [14:01:19] btullis: great thanks :) [14:06:04] hnowlan: let me know when the restbase depool is done thanks :) [14:06:18] topranks: done! [14:08:40] hnowlan: thank you! [14:13:42] Just waiting on some iDRAC firmware upgrades that got started inconveniently to complete before rebooting switches [14:15:51] pages arriving [14:16:03] do we have a known cause? [14:16:17] I guess network maintenance? [14:18:50] the mathoid page was just an expired downtime [14:18:51] marostegui: I've not started the disruptive part of the maintenance [14:18:52] expired downtime [14:19:08] from k8s rebuild [14:19:08] yep akosiaris mentioned it in -operations [14:19:22] ok yep if it was those then I think we're ok [14:20:28] repeating what I said in -operations, you can safely ignore the helmfile deploys, I 've cordoned off the row A nodes [14:21:03] thanks [14:21:22] Everything is ready for the reboot now I am rebooting the row A switches, hard downtime starting now [14:21:33] ack [14:21:48] let's !log it ;) [14:21:58] volans: yep done [14:22:28] sorry missed htat [14:22:42] better than I missed the 'log' :P [14:32:33] switches are starting to come back online [14:33:01] still waiting on EX4300 / 1G units [14:34:00] master/backup roles assumed by asw2-a7-eiqad and asw2-a2-eqiad [14:34:54] MAC forwarding tables being populated on 10G devices [14:35:26] EX4300 / 1G units showing online and present in VC now [14:35:56] Netbox is working for me, at first glance things look ok :) [14:37:38] {◕ ◡ ◕} [14:39:52] tested two puppet runs in eqiad, re-enabling in general now [14:40:03] moritzm: ack, thanks [14:41:19] done [14:41:57] ftr pki can stay in codfw [14:42:10] I am deploying things fine in kubernetes row A fwiw [14:42:33] akosiaris: great :) [14:44:49] topranks just confirming, switch reboots are done? [14:46:00] inflatador: yes all complete and I've not found any sign of problems [14:46:19] think we're good, I'll give official all clear in another 5 mins just double-checking a few things [14:46:34] topranks np, thanks for the update [14:46:34] cool. can we proceed with host reimages ? [14:46:51] akosiaris: yeah no reason not to [14:46:59] thanks! [14:47:57] I want to increase the systemd timeout for smartmontools timeout from default of 90s to 300s (since on the latest swift backends it takes about 2 minutes). Is it OK to do this in the smart class, or should I make the change swift-backend-specific? ISTM that increasing the timeout should be generally benign [14:48:35] startup/stop timeout that is [14:48:44] Emperor: is that run also via NRPE? [14:49:57] volans: not sure; the issue I'm having is that nagios is unhappy because systemd is marking the unit as failed because it gets bored waiting for it to start up [14:50:15] I wouldn't think nagios would start up the entire daemon, rather query it for status [14:50:35] I'm happy to declare the switch maintenance successful / over [14:50:47] Anything that's not already re-pooled we can go ahead and do now [14:51:02] what I'm saying is that if we call smartmontools from nrpe and it takes longer than the nrpe timeout it will fail anyway [14:51:07] \o/ topranks [14:51:58] thanks all for the assistance / help! I know these row-wide outages are a big headache :) [14:52:13] ganeti/eqiad is all fine as well [14:52:25] volans: I think this is just the time taking for smartd to initially start up [14:52:37] ack [14:52:58] (some light testing on ms-be2070 certainly suggests as much) [14:53:00] topranks: nicely done! [14:54:17] +1, nicely done topranks et al indeed [14:55:54] <3 [15:35:10] https://gerrit.wikimedia.org/r/c/operations/puppet/+/895309 <-- change to increase the timeout for smartd startup. +1 anyone? :) [15:38:44] looking [16:48:02] akosiaris: thank you for an uneventful redeploy of Toohub in the upgraded cluster. [16:51:23] The main sign I have been able to find of the whole process in the Toolhub site itself is just a gap in the crawler run log. There were no runs between 09:07 and 15:07 (as expected). [20:30:53] Hey all, having some problems deploying with scap, ebernhardson (SWE) can't seem to deploy https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags-scap-search but I can, is there a keyholder or other perms issue we need to look at? [20:31:53] what error are you getting? [20:33:02] taavi top message here https://phabricator.wikimedia.org/P45275 [20:35:08] scap.cfg in that repo specifies `keyholder_key: deploy_airflow`, so you need to look here https://gerrit.wikimedia.org/g/operations/puppet/+/2c8c679e083ad7067850d595b0aa6b578ef1a60e/hieradata/role/common/deployment_server/kubernetes.yaml#112 [20:35:24] you need to be a member of any of those four groups to deploy [20:37:02] thanks taavi ! Per https://gerrit.wikimedia.org/g/operations/puppet/+/production/modules/admin/data/data.yaml#900 it does look like he's in the correct group, is there something we need to sync maybe? [20:38:38] it's missing from `profile::admin::groups` on the deployment server [20:40:43] taavi awesome, so I just need to add that group to https://gerrit.wikimedia.org/g/operations/puppet/+/2c8c679e083ad7067850d595b0aa6b578ef1a60e/hieradata/role/common/deployment_server/kubernetes.yaml#8 ? [20:42:30] as long as that group does not add any unwanted sudo rules or other side effects to deploy* hosts, then yes [20:43:22] cool. I don't think it does, but you can see the perms it grants in the data.yaml link above [20:46:11] topranks: if you're around: do you remember if the lvs interface vlan trunking stuff on the asw's is configured by netbox, or needs initial config manually [20:46:54] we're looking at a case (eqsin) where the switch side of the trunking doesn't seem to be configured at all, but netbox does know about the trunked interface and its IPs. So I'm starting to guess maybe it was meant to be manual? [20:47:15] bblack: hey [20:47:44] basically I know what's wrong, but I'm not sure which way to fix it :) [20:47:53] The switch-side config is driven from netbox, but the Netbox helper scripts that dc-ops use when deploying servers don't automatically add the right config [20:48:04] So we need to manually do it in Netbox, then push to the device with Homer [20:48:19] I can look now, what server is it ? [20:48:42] all of lvs500[567], but I think we need to take some care and do failovers, they're live. [20:48:50] err sorry wrong numbers [20:48:56] lvs500[456] [20:49:20] I suspect when we installed all the new hardware in eqsin a while back, we just never fixed up the vlan trunking on these. [20:49:27] let me have a look, depending on the existing config we can add the vlans without interrupting currently working connections [20:49:34] the hosts are configured for it, but not the switch side [20:50:31] the reason it has gone unnoticed, is we really only configure those public trunk connections for consistency and in case of future needs, but not actual production service relies on it working at the edge sites (only at the core) [20:50:33] Yeah - looks like Vlan 510 (public1-eqsin) is missing on the switch side [20:50:57] we only figured this out because it also screws up ssh-ing into these hosts from the local bastion, because the host wants to route the traffic through the broken trunked interface :) [20:51:19] ah makes sense [20:51:58] So in terms of the switch config we can make the change without affecting the link [20:52:19] ok, maybe try 5006 first, it's the secondary/backup one [20:52:24] it's only a logical change to the mode adding the tagged vlan, shouldn't affect any untagged traffic [20:52:25] ok [20:52:35] best to be safe anyway [20:52:36] well you have to change the port's mode to trunked too, and set native-vlan-id [20:52:41] but yeah I guess you know all that :) [20:53:08] I would have assumed it would blip the interface, but I'm terribly pessimistic in trusting juniper [20:56:16] where in netbox needs fixing anyways. I was compariing a working example from ulsfo and couldn't find where it would be anyways. [20:57:01] brain fart my side too - I was changing our "netbox next" test server [20:57:10] You need to edit the switch interface directly [20:57:14] (they both show an IP address in the correct vlan assigned to the tagged sub-interface, but I don't see anything related-but-missing) [20:57:34] oh, got it. I was assuming it was derived from metadata on the host side of this link [20:57:39] Yeah on the host side it's the switch side you need to change [20:58:16] So if you are looking at the ints for lvs5006, click the "xe-0/0/15" that should be under 'connection' [20:58:28] It'll bring the switch-side interface: https://netbox.wikimedia.org/dcim/interfaces/27732/ [20:58:41] Click EDIT on that page, then right at the bottom is the vlan setup [20:59:01] got it [20:59:02] We change mode to 'tagged, leave 'untagged vlan' alone, and add 510 to the 'tagged vlans' [20:59:28] I have an objective to improve our scripts for DC-ops to allow this to be done at host provision time fwiw [20:59:38] is LVS the only case like this? [21:00:02] we will eventually get rid of it, which maybe would be the end of this insanity :) [21:00:02] Our Ganeti hosts are the same, and now WMCS have moved to using tagged handoff [21:00:08] ah ok [21:00:10] So it's becoming more of a pain [21:00:38] Ok I'll run Homer to change lvs5006 that ok? [21:01:26] topranks: yeah [21:02:37] Ok done [21:02:42] I was running a ping from the serial console [21:02:48] It did lose 1 ping [21:03:11] I can confirm it does fix the actual problem though, and there was no link-loss at the host level [21:03:24] I can just temporarily depool the other two just in case [21:04:13] Yeah I had "ip monitor" running too, definitely no physical link drop [21:04:29] we're at the low point in eqsin traffic for the day anyways, so I can just fail both lvs500[45] over to the now-fixed 6 for a bit [21:04:29] taavi one more question if you're able, do we need to reload keyholder after we add the perms? [21:04:34] But was an interruption to traffic, presumably while switch changed its internal state [21:04:53] bblack: ok cool, let me know when that's done, I'll prep the change for those now [21:04:58] It's still bombing out with the same error ;( [21:06:53] topranks: you're good to go! [21:07:42] incidentally, I had never used "ip monitor" before today. I found out about it while trying to debug the same issue that lead me here heh :) [21:07:49] bblack: running [21:08:05] heh yeah it's very useful, I only found out about it last year or so. [21:08:06] pretty handy thing! [21:08:21] very useful for seeing what the kernel is doing under the hood [21:08:39] yeah, it was showing me basically that ARP was failing, which was my critical clue [21:09:13] gotcha, I was running here to see if there was a momentary UP/DOWN transition for the interface [21:09:25] Ok that's changed switch-side now so hopefully we are ok [21:10:07] ok, repooling [21:11:28] I don't see why the connection from bastion was failing, the DNS records point to the private IP [21:11:35] and the default on the hosts is via that interface [21:11:56] topranks: the problem is the bastion's on the same public1 /28 network as that trunked vlan interface. [21:12:06] ah yeah of course [21:12:08] so the ssh packets from bastion->lvs get there ok, but lvs wants to reply over the trunk [21:12:11] and it fails :) [21:12:38] since nobody regularly uses the eqsin bastion, nobody really noticed [21:12:57] gotcha [21:12:59] but apparently vg reconfigured his ssh config to use the site-local bastion for various .foo.wmnet and found it today [21:13:45] topranks: in any case, all fixed and back to normal state now, thank you! [21:13:48] well like I say 2 things on my agenda over next few weeks are to add an option to our 'server provision' script to trunk additional vlans [21:13:53] so it can be part of normal workflow [21:14:29] The other is to slightly adjust how we represent the host interfaces on the Netbox side, which will enable us to add a report to catch any discrepancies like this that might occur [21:14:39] so the situation should improve [21:14:42] anyway glad it's ok now! [21:14:43] sounds like progress!