[05:52:07] 10serviceops, 10Performance-Team: Switch ChronologyProtector from redis to memcached - https://phabricator.wikimedia.org/T314453 (10Krinkle) [05:52:24] 10serviceops, 10Performance-Team: Switch ChronologyProtector from redis to memcached - https://phabricator.wikimedia.org/T314453 (10Krinkle) [05:52:48] 10serviceops, 10Performance-Team: Switch ChronologyProtector from redis to memcached - https://phabricator.wikimedia.org/T314453 (10Krinkle) [06:34:13] 10serviceops, 10Performance-Team: Switch ChronologyProtector from redis to memcached - https://phabricator.wikimedia.org/T314453 (10Di3sel1975) [06:41:40] 10serviceops, 10Performance-Team: Switch ChronologyProtector from redis to memcached - https://phabricator.wikimedia.org/T314453 (10Aklapper) [07:03:46] I'm switching kubestagetcd2002 to "plain" disk storage temporarily [07:46:03] <_joe_> hnowlan: looks like concurrency improved, but also latency of the single job went down [08:24:41] <_joe_> jayme, jelto I need you to work with me on turning off servers for today's maintenance [08:24:55] o/ [08:25:06] <_joe_> it was completely unhandled and I can't do this work alone as I have 4 hours of meetings in the afternoon [08:26:44] <_joe_> so, first we need to verify which servers owned by us will need to be turned off [08:27:55] _joe_ do you have some more context/phab ids? I'm just aware of some racking and decomm [08:28:11] https://phabricator.wikimedia.org/T309956 is the master task AIUI [08:28:28] racks B3 and C2 due today [08:28:35] <_joe_> no [08:28:37] <_joe_> more [08:28:38] https://phabricator.wikimedia.org/T310070 https://phabricator.wikimedia.org/T310145 [08:28:39] <_joe_> https://phabricator.wikimedia.org/T310070 [08:28:59] ah, indeed [08:29:04] <_joe_> also b6-b8 [08:29:17] <_joe_> I guess first order of business is to completely depool codfw from anything [08:29:30] <_joe_> including restbase-async, that we need to move to eqiad [08:29:47] <_joe_> there's way too much stuff going down not to provide a degraded service [08:31:07] <_joe_> the thorny server is conf2004 tbh [08:31:15] <_joe_> that I'd turn off only when strictly needed [08:31:25] <_joe_> valentin is reconfiguring pybal not to use it [08:31:39] <_joe_> I'm commenting it out for DNS SRV records now [08:32:39] <_joe_> hnowlan: can you take care of the restbase servers? it seems to me that quite a bit too many of them are going down today [08:32:41] from https://phabricator.wikimedia.org/T310070 it seems as if B7, B8 are empty, which is not true [08:32:53] <_joe_> jayme: yeah that tricked me too [08:33:16] <_joe_> jayme: our mc server is newly racked, not in production [08:34:25] mc2046? that's nice - means we're fine for B7 [08:35:17] <_joe_> jayme: yeah we have other issues [08:35:27] <_joe_> we need to go around the racks that were maintained yesterday [08:35:36] <_joe_> and bring back the servers that were powered down [08:35:37] in B8 we have parse2008-2010 and mc2025,2026 [08:35:55] <_joe_> the mc2025 and 26 are annoying but we can live with that [08:36:20] should we do a quick meet to gather/plan/split everything? [08:36:33] <_joe_> please I have enough meetings today :/ [08:36:53] fine with me - just thought it might be quicker [08:37:15] <_joe_> yeah maybe :P [08:39:33] <_joe_> so let's go with order [08:39:41] <_joe_> first let's look at the racks done yesterday [08:39:50] <_joe_> it's challenging being scattered across multiple tasks [08:39:52] * jayme checking [08:40:39] https://etherpad.wikimedia.org/p/F2pDHw25sm7eJS1Q3Tk2 [08:41:28] <_joe_> yeah, no, let's do things a bit better :P [08:41:37] <_joe_> lemme organize there [08:42:03] I can help if needed [08:42:07] will read :) [08:43:45] _joe_: yep looking [08:44:06] I'm still a bit lost. I guess I need a little bit more context on what "look at racks done yesterday" means. Like powercycle them, doing basic health checks/login and pool them etc. [08:44:32] <_joe_> powercycle them and verify puppet runs should be enough [08:44:40] <_joe_> I don't think anyone depooled anything [08:44:50] ack [08:45:49] why would we need to powercycle if they are powered on already? [08:46:39] <_joe_> jayme: they're not AIUI [08:46:47] <_joe_> we need to power them on [08:46:56] <_joe_> so, look at the etherpad [08:47:01] <_joe_> I'm starting to add information [08:47:07] ok. that makes more sense to me then [08:47:12] <_joe_> sorry, I don't have more context than you have [08:47:34] <_joe_> I assumed someone was handling this [08:49:01] <_joe_> so, let's start with the powerups [08:49:08] <_joe_> I'll take A7 [08:50:23] B1 is only cloud* [08:50:57] <_joe_> ok, write it in the etherpad :) [08:51:04] <_joe_> I hope sre.hosts.reboot-single will work [08:51:53] restbase shouldn't be a concern btw, three hosts all in the same rack shouldn't cause any degradation in performance and assuming we're not going to be out for days there's no risk of data loss [08:52:34] <_joe_> hnowlan: we will need someone to power them back up after maintenance [08:53:12] can do [08:54:41] <_joe_> ok reboot-single works to bring back hosts [08:54:57] nice [08:55:06] I'll take B5? [08:55:12] losing those thumbor hosts is almost guaranteed to cause issues [08:55:49] <_joe_> hnowlan: yeah... [08:56:05] not much that can be done, they're stretched thin as is and those are the newer servers [08:56:14] <_joe_> hnowlan: can you discuss that with Emperor in #sre? [08:56:23] <_joe_> he's managing the swift cluster [08:56:51] <_joe_> hnowlan: do we need to reimage a couple servers to be thumbor servers? [08:56:57] _joe_: should we power on role(insetup) nodes as well? [08:57:06] <_joe_> jayme: yes [08:57:48] _joe_: even outside of this, yes most likely. Mat was trying to scrounge some hardware for this a few weeks ago aiui [08:58:09] <_joe_> ok, let's be practical here [08:58:16] <_joe_> thumbor can't be "depooled" right? [08:58:27] <_joe_> unless we completely depool swift in codfw [08:58:38] <_joe_> and then re-sync it [08:58:48] <_joe_> uhm on second thoughts [08:59:13] <_joe_> we can live without some thumbnails I guess, but the jobqueue will still try to generate them in both DCs [08:59:37] yeah, backlogs will build up etc but it won't be a disaster [08:59:49] but there'll be nopise [08:59:51] *noise [09:02:44] I facing failing cookbooks with sre.hosts.reboot-single. Fails with "100.0% (1/1) of nodes failed to execute command 'reboot-host': mc2024.codfw.wmnet". Do you use some special settings? [09:04:14] <_joe_> uhm [09:04:28] <_joe_> jelto: sorry, I just saw after writing [09:04:37] <_joe_> all servers were already powered up [09:04:39] <_joe_> minus one [09:04:42] <_joe_> mc2024 [09:04:49] <_joe_> guess which one I checked this morning? [09:05:37] <_joe_> so yeah, you'll need to get to console [09:13:56] <_joe_> ok, I'm looking at the servers to power down [09:14:25] <_joe_> you can start downtiming, then shutting them down, minus the ones I'll mark for each rack [09:22:15] think I completed the list of "our" nodes in each rack in the etherpad [09:22:54] do we downtime them like 24h? [09:24:51] <_joe_> jayme: I would do that for all servers but the ones I marked [09:24:59] A already maintained rack is "DONE" in etherpad when all of our hosts have power and are online? is a sre.hosts.reboot-single run still needed? [09:25:19] <_joe_> jelto: I don't think so [09:25:26] <_joe_> sorry, gotta go afk for 5 minutes [09:25:46] <_joe_> jelto: did you log into mc2024 via console and rebooted it? [09:26:27] <_joe_> I see you did [09:26:29] _joe_: yes I powered it on and currently running sre.hosts.reboot-single [09:26:57] _joe_: re: downtime. All as in "all servers in the rack"? [09:27:14] <_joe_> so for the mw servers [09:27:21] <_joe_> we need to set them as pooled=inactive [09:29:26] <_joe_> jelto, jayme I suggest you start powering down the mw* servers for now [09:29:36] <_joe_> so the quickest way I can imagine is [09:30:26] <_joe_> ssh $server sudo /bin/bash -c "decommission && shutdown -h now" [09:30:33] <_joe_> (not tested) [09:30:47] <_joe_> the "decommission" script just sets pooled=inactive for the server to all services [09:30:57] <_joe_> ok, I need to go afk :) [09:31:00] <_joe_> tty in 5 mins [09:33:52] puppet is failing on powered on host mc2024 with "Error while evaluating a Function Call, No Redis instances found for 10.192.16.61 (file: /etc/puppet/modules/profile/manifests/redis/multidc.pp, line: 27, column: 13" [09:34:42] <_joe_> jelto: it's ok, disregard, i'll fix it [09:34:57] <_joe_> we've swapped in another server [09:37:19] ack [09:37:37] should I cordon and drain the kubernetes nodes? [09:38:12] <_joe_> jelto: that would be ideal I think [09:38:19] <_joe_> jayme: any opinions? [09:38:36] yeah, go ahead. I've added that as action item to the etherpad [09:39:24] <_joe_> please mark the servers as done on the task too when you've done all the ones we can in a rack [09:39:37] <_joe_> I'll work on the software patches [09:40:19] I'm going to run downtime cookbook (24h) for all of our nodes not marked by joe [09:41:01] <_joe_> ack [09:49:32] downtime is set [09:50:06] need to get some breakfast/lunch now. Back in ~30min [10:01:47] jayme, _joe_: I cordoned and drained the kubernetes nodes. However around 20 pods are pending due to lack of cpu resource. Is this fine during the maintenance in codfw kubernetes cluster? see kubectl get pods --field-selector status.phase=Pending -A [10:02:20] <_joe_> jelto: we can reduce resources once we've depooled services I guess [10:08:42] ack [10:11:00] <_joe_> jelto: you should also depool those k8s nodes before powering them down ofc [10:18:14] _joe_: ok. So I'll do "confctl select "name=kubestage2002.codfw.wmnet" set/pooled=no" for all of the nodes first (with replacing the name ofc)? [10:18:34] <_joe_> jelto: or you ssh to the node and run "sudo depool" [10:18:38] <_joe_> but yes [10:24:44] _joe_: you are using "shutdown -h now" on the hosts to power them down, correct? I'll do the same for the kubernetes nodes? They are depooled now [10:25:11] <_joe_> jelto: yes [10:25:26] <_joe_> shutdown -h now && logout if you're in a shell [10:25:32] <_joe_> else it will hang :) [10:25:45] ack I'll do that now and mark hosts as DOWN in the pad when done [10:26:10] <_joe_> thanks [10:26:19] <_joe_> don't bring down all k8s servers though [10:26:35] <_joe_> if some have still pods, we should first do something like e.g. depool mobileapps, reduce the pods [10:26:41] <_joe_> and btw [10:26:43] <_joe_> mark all this [10:28:24] I just depooled and drained kubestage2002 kubernetes2020 kubernetes2009 kubernetes2010 kubernetes2011 kubernetes2012. All other kubernetes hosts in codfw are untouched and running normally [10:49:54] _joe_: there are some PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect alerts after shutting down the first kubernetes nodes. Is this expected? Should this be silenced/fixed/acked? [10:50:24] <_joe_> I guess acked, but talk with arzhel [10:52:03] jelto: yeah, that's the bgp sessions of the k8s nodes going down [10:52:34] there will also be calico pod alerts that need to be silenced in alertmanager [10:54:23] ok I'll try to create a 24h silence for the 6 affected kubernetes nodes [10:55:24] I've pinged Arzhel in -traffic [11:01:08] I'll start shutting down mw servers now [11:02:43] <_joe_> thanks [11:06:38] all 6 kubernetes hosts are down now [11:13:52] <_joe_> jelto: thanks [11:14:21] <_joe_> jayme, jelto: I'd depool codfw now for all services [11:14:49] <_joe_> excluding quite some stuff, you can check the list of excluded services in [11:14:57] relevant restbase hosts are all down [11:15:01] <_joe_> cumin1001:~oblivian/dc_maint.sh [11:15:13] <_joe_> hnowlan: about to depool restbase in codfw from all requests [11:15:22] <_joe_> that means all changeprop requests will go to eqiad [11:15:29] ack [11:15:37] <_joe_> can you check the load on eqiad in the next ~ 30 minutes? [11:15:41] will do [11:15:48] <_joe_> anyone has objections? [11:16:02] _joe_: have not checked the list as of now [11:16:31] <_joe_> jayme: I see some alerts from services in codfw [11:16:38] <_joe_> so I'd like to depool them quickly [11:17:21] skimmed it - looks good [11:17:36] <_joe_> ack, proceeding [11:18:02] <_joe_> hnowlan: you should downtime those hosts on icinga [11:18:16] <_joe_> also, someone needs to also update the tasks :) [11:18:25] those aren't the ones being shut down, I guess all restbase hosts will be spamming if there are mw hosts down [11:19:57] going to downtime restbase2*.codfw.wmnet I guess for a few hours unless there's a more elegant way to do it [11:20:41] maybe ignorant question but why would they be spamming in case of depooled mw hosts being down? [11:21:16] not certain, I wouldn't assume it would cause issues but I figured that was the relevant change recently? [11:21:28] that said they seem to have stopped so maybe it was a temporary thing [11:22:12] hmm...maybe they had active connections to mw hosts I just powered down [11:22:26] <_joe_> hnowlan: oh sigh katotherina needs to run from codfw? [11:22:31] not really relevant to this situation but it looks like the pybal yaml files aren't updating https://config-master.wikimedia.org/pybal/codfw/restbase-https [11:22:41] _joe_: argh [11:22:52] I'm not 100% certain of current state, one sec [11:23:04] <_joe_> I think so [11:24:02] <_joe_> ah I excluded it already [11:24:05] <_joe_> I'm a good boy [11:24:25] heh. I've asked if eqiad is safe to go but I guess it's okay to leave up? [11:26:46] <_joe_> yes [11:26:53] <_joe_> we're not touching maps servers after all [11:27:28] <_joe_> ok, time to write an email to the rest of the team [11:27:43] _joe_: parse2008-2010 I would also set pooled=inactive, right? [11:27:54] <_joe_> yes [11:28:01] <_joe_> sudo decommission [11:28:03] <_joe_> on the server [11:28:05] yes, sure [11:28:11] <_joe_> those are parsoid servers [11:28:13] <_joe_> new name [11:28:15] <_joe_> :P [11:28:20] yes, I know [11:28:26] anything special for mc and rdb? [11:28:38] the ones *not* having a note of yours [11:28:53] <_joe_> jayme: (cordoned nodes will not be checked by pybal) [11:28:56] <_joe_> that is not true [11:29:04] <_joe_> they will be checked and depooled [11:29:14] <_joe_> but we need to 'decom' them too I think [11:30:08] _joe_: but the pybal kubernetes plugin does filter unshedulable nodes [11:30:21] <_joe_> jayme: oh we are using it now? [11:30:25] <_joe_> I thought we were not [11:31:04] https://github.com/wikimedia/PyBal/blob/master/pybal/kubernetes.py#L72 [11:31:22] well...possible that it is not rolled out ofc. [11:31:29] I've not checked on the servers directly [11:31:38] <_joe_> yeah "master" [11:34:26] so we're not using the kubernetes specific code at all, right? [11:35:21] <_joe_> go look at pybal logs in codfw [11:35:26] <_joe_> that will clarify that [11:35:44] phew...that breaks some of my assumptions for sure [11:37:38] ok. I'm going to set the k8s nodes to pooled=inactive (cc jelto) [11:37:48] ack [11:38:24] <_joe_> gonna eat something, bb in 5 mins [11:41:40] * jayme done [11:42:57] hmm...we have a couple of k8s nodes in codfw with weight=0 ...that does not seem right [11:43:21] the five new ones [11:44:37] <_joe_> jayme: sigh [11:44:42] <_joe_> yeah that needs to be fixed [11:44:59] doing so [11:45:18] we should probably also lower the weight on the ganeti VMs ...but that's something for later [11:49:25] <_joe_> thanks for the help people [11:49:36] <_joe_> you've been amazing :) [11:49:41] _joe_: anything special for mc and rdb? the ones *not* having a note of yours [11:49:56] <_joe_> jayme: see the bottom of the etherpad [11:51:51] <_joe_> thanks for the live typocheck :D [11:52:00] :) [11:52:13] I meant anything special for shutting them down [11:52:21] mc2037, rdb2008 [11:53:38] * jayme going to update the row B task [11:55:27] <_joe_> nope [11:55:35] <_joe_> 37 is not in rotation IIRC [11:55:45] <_joe_> and rdb2008 is a passive replica [11:57:20] how would I check the former? [12:01:03] <_joe_> jayme: wait, that server is role::memcached in site.pp? [12:01:09] <_joe_> if so, it's in rotation, apologies [12:01:17] <_joe_> I thought only up to 36 were [12:01:24] <_joe_> and I added 38 this morning [12:01:47] <_joe_> ah right [12:01:50] <_joe_> it's installed [12:01:57] <_joe_> let me see if it's also in the shards [12:01:57] yes, mc2037 is role(mediawiki::memcached) [12:02:21] <_joe_> and yes, it's in prod already [12:02:38] <_joe_> see hieradata/common/profile/mediawiki/mcrouter_wancache.yaml [12:03:05] yes..that's what I was grepping and what confused me :) [12:03:13] <_joe_> ok, cool :) [12:03:24] <_joe_> sorry I forgot we had a hardware failure in codfew [12:03:55] would you mark it in etherpad with the "will need to be..." sentence ... ah:) [12:05:27] <_joe_> {{done}} :P [12:05:28] do we need to do something for the probably decommed mw nodes to not boot? [12:05:33] <_joe_> I made it more readable [12:05:44] <_joe_> just remove them from the list and the task [12:05:51] they probably have mbr whiped, right? [12:05:58] <_joe_> hopefully [12:07:18] <_joe_> jayme: do you think preparing a schedule in the etherpad would be helpful? [12:07:42] <_joe_> if you don't have more things to do, I suggest you take a break and come back before the maintenance [12:07:53] <_joe_> I won't be around by then [12:08:14] <_joe_> ...and in theory we should prep for tomorrow! [12:10:28] what do you mean by shedule? Who will be around kind of shedule? [12:10:55] <_joe_> no more like "when will a rack go down" [12:11:08] <_joe_> anyways lucasz should be online soon [12:11:11] ah, could make sense to add it there [12:11:30] <_joe_> I'll take a look at tomorrow's round [12:11:36] <_joe_> so at least we're prepared [12:11:42] do we know how long the swap is supposed to take? Assuming the "Time" column contains the start time [12:12:00] <_joe_> 30-60 minutes [12:13:45] ok [12:14:17] I'll quickly update row C task and add the times to the etherpad than [12:15:31] _joe_: you've not marked B4 poweron as DONE - I suppose it is? [12:15:44] <_joe_> B4?? [12:15:56] the one from yesterday [12:15:59] top of the etherpad [12:16:16] power*on* :-) [12:16:19] <_joe_> ah sorry, yeah I think it's done since this morning [12:16:29] ack [12:18:02] <_joe_> jayme: I'll reduce the replicas for mobileapps in codfw [12:18:10] <_joe_> that should solve the k8s cpu starvation [12:18:34] funny. In row C task, the second column is called "do you need to depool?" instead of "Is server depool and power down?" - slightly confusing [12:18:38] _joe_: ack [12:18:51] <_joe_> jayme: yeah I'm not 100% sure [12:19:17] about what exactly? [12:19:54] <_joe_> if we needed to power donw the servers, then [12:20:01] <_joe_> that's what was asked yesterday [12:20:21] yeah...well...we already did, so 🤷 [12:22:02] assuming B5 was done yesterday (which I didn't know at the time), the two restbase hosts weren't rebooted at all? [12:22:52] <_joe_> hnowlan: no idea, check the uptime :P [12:22:58] <_joe_> but I think they were rebooted [12:23:23] yeah, that's what I mean - uptime says 97 days :) [12:23:47] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩ī¸): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) [12:23:52] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Deployment Autopilot 🛩ī¸): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) 05Open→03Resolved [12:23:56] uptime on restbase hosts is ~96 days [12:23:56] 10serviceops, 10SRE, 10Release-Engineering-Team (Doing): Reduce latency of new Scap releases - https://phabricator.wikimedia.org/T292646 (10jnuche) [12:24:11] in B5 [12:25:22] maybe depooling/shutting down is a precautioin? [12:25:25] *precaution [12:26:50] ah they have two PSUs which are redundant. At least racadm getsel has some entrys from yesterday about power loss on one of the PSUs [12:28:02] <_joe_> I assumed we were asked to power them down because they'd lose power for sure [12:28:05] <_joe_> anyways [12:28:27] <_joe_> https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/820116 if any of you wants to review [12:28:37] <_joe_> reduces the replicas of mobileapps to 50 [12:28:48] <_joe_> that should lower us below the allocation problem [12:29:31] +1ed [12:30:18] <_joe_> thanks [12:33:50] 10serviceops, 10Patch-For-Review: Migrate node-based services in production to node16 - https://phabricator.wikimedia.org/T308371 (10Jdforrester-WMF) [12:35:29] <_joe_> hnowlan: so that you know, I did depool thumbor in codfw [12:35:36] <_joe_> so we shouldn't be at risk [12:35:46] _joe_: ack [12:36:10] <_joe_> how's restbase in eqiad doing btw? [12:37:23] <_joe_> ok, I'll take a break before my afternoon of meetings [12:37:36] <_joe_> jayme, jelto anything more you need from me? [12:38:23] don't think so...have you already checked the racks for tomorrow? If not I can prepare yet another etherpad [12:38:54] (or better: add to the existing one) [12:39:53] _joe_: looking busier so far but nothing too bad [12:40:07] I'm gonna step away for a bit before things start happening too [12:42:00] _joe_: I don't need anything more, I guess. Beside turning off the specific hosts before maintenance, monitoring possible incidents and powering machines on again there is nothing more todo today? [12:42:37] <_joe_> jelto: well, we could repool codfw after maintenance is done [12:42:52] <_joe_> but that's well after you'll be offline [12:44:11] Do we repool codfw tonight again or keep it depooled until tomorrow after 17UTC, when most of the work is done? [12:53:34] <_joe_> jelto: yeah I mean we can [12:53:45] <_joe_> we're disrupting work of other people... [12:54:09] <_joe_> I'll send one more email to ops@ saying codfw will stay depooled for good until friday at least [12:54:22] <_joe_> embarassing tbh. [12:54:44] <_joe_> it makes no sense in fact to repool it for like 3 hours [12:55:14] <_joe_> ok, going afk for reals, I'll be back in ~ 1 hour at least [12:55:30] <_joe_> (and it will be meetings at that point :/) [12:56:01] <_joe_> sobanski: I hope that between the email and the etherpad there's enough to coordinate; jayme can fill in the blanks in my absence [13:17:46] todos for this weeks swaps should be in etherpad now. thanks all [13:20:38] _joe_: jayme: did you all do anything about the jobrunners? mw2259 is still pooled for jobrunners and was causing scap errors [13:21:52] humm...checking [13:21:58] 09:21:10 fwiw there is also an apiserver (mw2317) that's also pooled & unreachable. didn't check the rest of the hosts though. [13:22:17] need pooled=inactive so scap sync-all ignores them [13:22:50] cdanis: strange ... thanks. I'll double check all of them [13:23:04] (and set them to inactive ofc) [13:23:07] thanks! [13:24:38] thanks for noting [14:05:58] I filed https://github.com/kelseyhightower/confd/issues/859 [14:08:52] cool, thanks [14:13:39] 10serviceops, 10conftool: confd does not re-resolve SRV records after startup - https://phabricator.wikimedia.org/T314489 (10CDanis) [14:15:25] jayme: conf2004, thumbor2003 and thumbor2004 need to be powered off right before B3 pdu replacement. Any preferences? I can take care of depooling (set/pooled=inactive?) and shutting down tumbor2003 and 2004 if you like. [14:15:52] <_joe_> jelto: given I depooled thumbor, I think we can just power them off [14:16:10] <_joe_> and the only server that will need special care will be conf2004 [14:16:31] {"thumbor2004.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=thumbor,service=thumbor"} [14:16:31] {"thumbor2003.codfw.wmnet": {"weight": 10, "pooled": "yes"}, "tags": "dc=codfw,cluster=thumbor,service=thumbor"} [14:16:42] looks pooled to me, or do I miss something? [14:18:22] the service itself (discovery) is depooled in codfw: confctl --object-type discovery select 'dnsdisc=thumbor.*' get [14:19:06] ah okay thanks for the hint :) [14:19:53] _joe_: is there anything else needed for conf2004 apart from shutdown rigt before, boot right after, check etdc cluster-health? [14:20:19] ok so I'll just power down thumbor2003 and 2004 in ~5 minutes? [14:20:25] something zookeeper probably [14:20:46] <_joe_> jayme: nope [14:20:50] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Michael) Glancing at the repository, I'm not sure if there is anything that you need from us to migrate `wikibase/termbox` on Wikidata? Though I'm not very familiar with this particular part... [14:21:13] <_joe_> jayme: check zk logs yeah [14:21:23] _joe_: ack [14:21:49] 10serviceops, 10conftool, 10Patch-For-Review: confd does not re-resolve SRV records after startup - https://phabricator.wikimedia.org/T314489 (10CDanis) [14:25:07] I've asked in -dcops if they announce the start of work before they actually fo ahead...if they don't respond, I'd say we take the servers down 14:27 - jelto [14:25:20] (i'll do conf2004) [14:25:44] ok thanks! makes sense. I'll turn off thumbor2003 and 2004 14:27 or later, if we hear something in -dcops [14:26:36] looks like they just stick to the shedule :) [14:27:10] ok shutting down thumbor2003 and thumbor2004 now [14:27:35] same for conf2004 [14:31:21] forgot to downtime those, done that now 30M [14:32:14] thanks! [15:01:41] I'm going to turn of mc2023 if dcops announce they are done in B3 and move to B6. I assume there is no additional depooling needed for mc hosts? Correct me if I'm wrong :) [15:02:41] There are some recovery alerts. I guess mc2023 should be turned off now? any comments? [15:03:38] I'd say "go". Downtime is already set for 30m [15:03:41] I'm going to power off mc2023 [15:08:47] conf2004 came back on it's own [15:09:37] thumbor2003/2004 also back online. racadm told me "Server is already powered ON." [15:16:32] conf2004 looks fine, so do etcd and zookeeper. I'm going to revert the dns change [15:20:26] I'm going to check mw2259-2270, do a scap pull, check icinga alerts and pool them if everything is green [15:21:38] 👍 [15:29:05] dns change is reverted and confd restarted in codfw,eqsin,ulsfo [15:39:54] jelto: you need help with the mw hosts? [15:40:17] I'm almost done with mw2259-2270, feel free to start with mw2310-2324 [15:40:31] icinga checks take quite long and one machine was not powered on automatically [15:44:37] jayme: mw2259-2270 are all green (except DSH group alert) and scap pulled. I'm going to pool this hosts again, ok? [15:50:11] jayme: yeah, go ahead [15:52:35] pooling mw2259-2270 again [15:53:56] ack. I'll take care of mw2310-2324 [16:09:30] mw2259-2270 pooled and all green, I'll remove the downtime [16:10:04] mw2310-2324 are done as well [16:15:01] rzl | mutante are you around already? [16:15:17] just ending an interview, with you in a sec [16:16:38] ah, okay. Sorry for interrupting then [16:17:06] okay! I started reading back earlier but I haven't got full context yet [16:18:24] no rush. We're working along pa.paul replacing PDUs - worklog is at https://etherpad.wikimedia.org/p/F2pDHw25sm7eJS1Q3Tk2 [16:20:22] Rack B6: rdb2008, kubernetes2009-2010, kubernetes2020 are already online again, mc2023 is not [16:20:43] powering on mc2023 [16:21:00] jelto: maybe the rack is not ready yet [16:21:38] pa.paul said ~25min ago it's "almost" done [16:22:13] jayme: pa.paul posted "servers are coming back up in B6" at 15:56. My understanding is they moved to b7. But not 100% sure [16:22:24] oh, that I missed [16:22:25] ok [16:22:42] try to power on then I'd say [16:23:00] * jayme will uncordon k8s nodes [16:23:42] mc2023 is online again, I'll wait for puppe run and icinga and remove downtime then [16:29:33] I guess we have to turn off mc2025-2026 soon, they are about to finish B7. [16:30:02] yep [16:30:06] can you do it? [16:30:47] I'll do that, powering down mc2025-2026 [16:30:56] thanks! [16:31:23] rdb2008 is reporting PS redundancy critical. I've pinged papaul but I'm removing the downtime anyways for visibility [16:32:25] looks like that's fine [16:32:37] s/fine/expected/ [16:33:41] i'm taking care of B6: mw2325-mw2334 [16:40:21] mc2023 done and downtime removed. I'll be afk for ~10m [16:40:43] ack [16:42:36] jayme: what can I pick up? [16:43:43] o/ AIUI papaul is currently working on B8 now (see -dcops && -sre-private for updates from him) [16:44:45] would be nice if you could pick up the work of shutting down mc2027,2037 when he gets to C1 [16:45:16] as well as ansuring the hosts from B8 and C1 come up again and remove their downtime [16:45:44] there are "power on instructions" in the etherpad [16:46:52] lmk if you see anything that's unclear to you - and sorry for the ambush [16:47:09] jayme: can do -- silly question sorry, what commands are you using to actually power on and off? it's via ipmi, right? [16:47:38] I just "shutdown -h now" [16:48:21] sure, and papaul takes care of powering back on when he's done? [16:48:24] the servers usually come up on their own. If not, you need to do so via mgmt / try reboot-single cookbook [16:48:30] ahh okay cool [16:48:40] thanks - I should be all set then [16:48:56] appreciate everyone's work on this while I was asleep <3 [16:48:59] ❤ī¸ [16:49:55] I'll be around another 10-15min or so, but I really need to get some food then :) [16:51:00] I'm eating currently and can assist with B8 and C1 [16:51:27] definitely both get out of here and take it easy please :) [16:56:50] rzl: so you have enough context to do B8 and C1 on your own? [16:57:40] yeah I think I should be set -- the instructions in the etherpad are complete, right? [16:57:53] if anything unexpected happens I can ask around with folks in this TZ [16:58:34] instructions are complete, yes [17:00:17] 👍 [17:01:39] ok then I'll go offline for today :) thanks for taking over the remaining racks [17:02:02] cool, thanks Reuven! If anything bad happens, feel free to send a Signal message. I won't be far from the laptop [17:02:28] sounds good! hopefully won't be necessary :) thanks again both, have a good evening [17:02:40] have a nice day! [17:22:55] shutting down mc[2027,2037] [18:37:53] 10serviceops, 10Content-Transform-Team, 10Maps: Re-import full planet data into eqiad - https://phabricator.wikimedia.org/T314472 (10Aklapper) [21:41:35] anyone in service ops around to help us deploy changeprop? Re: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/819752/3/charts/changeprop/templates/_jobqueue.yaml [21:42:03] apologies in advance for our n00bery, we don't do this one very often ;) [21:43:08] ^ for some context I'd have expected `helmfile -e [eqiad,codfw] diff` to show a diff but it seems like it's not seeing a diff (and fwiw the git repo is in the state it should be) [21:43:25] whereas with `helmfile -e staging diff` / `helmfile -e staging -i apply` it did see the change [22:06:16] Seemed like it might have been a problem with the `selector` but that didn't seem to fix things: [22:06:18] https://www.irccloud.com/pastebin/HN5dBTwX/ [22:14:50] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Krinkle) >>! In T279664#8123041, @MatthewVernon wrote: > Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` ru...