[07:54:17] Reminder that there's more rack work going on starting 14:30 UTC - yesterday's got delayed quite a bit by not having everything in the affected racks powered down in time... [07:55:14] codfw racks B3,6-8, C1-2 [08:20:44] <_joe_> Emperor: will these servers also need to be powered off until friday? [08:32:13] Emperor: o/ in rack B3 there is conf2004 afaics, that runs etcd and zookeeper. Do we know how much time these nodes will be down? With one node down we are ok, but in case another one goes down we may have problems (Kafka should keep going without zookeper in theory, but at the first need of config/metadata change in zk it will surely complain. etcd will also be problematic for pybal (I see [08:32:19] hieradata/role/codfw/lvs/balancer.yaml:profile::pybal::config_host: conf2004.codfw.wmnet for example) [08:33:41] <_joe_> same for etcd [08:34:05] <_joe_> I think I'll move etcd SRV records to point to eqiad everywhere tbh [08:34:15] <_joe_> and we need to fully depool codfw from all services I'd say [08:34:17] _joe_: AIUI, each rack is down for about 30 minutes only [08:34:42] if so, zookeeper will not be a problem, but we need to do some work for etcd to avoid issues [08:34:48] elukey: if you need "don't power this rack down until node [x] is back in service", pa.paul can work with that, if you comment on the ticket [08:36:07] Emperor: afaics only conf2004 is affected, so IIUC there is no risk of another conf2XXX node in the cluster down at the same time. I was more worried about accidental issues while the maintenance was ongoing (just to know how long etc..) [08:36:32] but yeah I think we should at least move the pybal config host to another conf2xxx node [08:36:56] (there is always the risk of not having the node back online for some reason, so more than 30 mins etc..) [08:37:38] <_joe_> elukey: that's being handled [08:37:54] _joe_: okok [08:37:59] <_joe_> but yeah for the time of the rack being down, we're basically at the mercy of any hw failure in etcd/zk [08:38:00] elukey: AIUI, the aim is to do the swaps quickly (~30m) which is the point of powering them down [08:38:48] Emperor: yes yes I completely trust Papaul and his work, I was just reasoning out loud with Joe and you about possible weird scenarios [08:39:32] fair enough :) [08:59:11] fwiw thumbor in codfw (and in eqiad for that matter) is running very hot, losing the two thumbor hosts will cause performance issues [08:59:21] the two hosts are in the same rack unfortunately [08:59:32] <_joe_> it's not performance even, it's functionality [08:59:46] <_joe_> Emperor: this is swift related if you want [09:00:08] <_joe_> one option is we disable completely thumbnail generation in codfw, and we depool codfw's swift from all traffic [09:00:11] * Emperor really doesn't ;p [09:00:33] AFAICT we can't obviously depool swift from write traffic [09:00:45] (last time I looked it would involve patching the bowels of MW) [09:04:01] <_joe_> not the bowels of mw, mediawiki-config [09:04:06] <_joe_> but originals is ok [09:04:31] I think if we depool codfw swift (for reads) that'll stop it trying to generate thumbs in codfw? [09:05:11] IIRC our local rewrite middleware does the thumb generation if it tries to serve a thumb and gets 404 [09:08:22] <_joe_> yes [09:08:26] <_joe_> that was my proposal [09:08:35] <_joe_> we'll still get thumbs from the jobqueue [09:08:41] <_joe_> btu we might manage [09:09:48] <_joe_> ok so [09:09:59] <_joe_> I'll prepare a script to depool all services from codfw [09:10:07] <_joe_> and the corresponding one to repool everything [09:10:31] <_joe_> frankly maintenance this extended on a core datacenter needs to be planned a bit better in advance [09:10:39] <_joe_> we ccan't consider codfw "standby" [09:11:01] <_joe_> I am guilty as charged, I didn't realize how much stuff would go down at the same time [10:01:16] I've updated T310145 and T310070 with the list of hosts affected by today's PDU work, please take a look! [10:01:16] T310070: (Need By:TBD) rack/setup/install row B new PDUs - https://phabricator.wikimedia.org/T310070 [10:01:17] T310145: (Need By:TBD) rack/setup/install row C new PDUs - https://phabricator.wikimedia.org/T310145 [10:37:00] cumin2002 will need to be powered off later the day for the PDU maintenance, if you have active sessions/tmuxes/scripts please move them to cumin1001 [10:37:27] ^ jbond, Emperor, jynus: you currently have active sessions there [10:37:55] I am not a blocker [10:38:24] me neither will close my sessions [10:38:35] done [10:38:47] ack, thx [10:50:16] done [11:26:36] <_joe_> Emperor: you should see read traffic to swift in codfw drain down now [11:32:27] grafana agrees :) [11:55:14] <_joe_> it's magic! [12:10:01] _joe_: I think that means I don't need to bother separately depooling frontends in affected racks - is that correct? [12:10:27] <_joe_> Emperor: it will still receive the write traffic from mediawiki [12:13:19] I think that's true even if depooled individually? [12:13:29] I might well be wrong though [12:20:38] <_joe_> Emperor: btw, did I understand correctly yesterday that dcops expects the servers to be powered down by us? [12:33:58] _joe_: indeed so [12:34:18] <_joe_> yeah I was confused by the tasks :) [13:13:44] hi all im going to add two new puppetmastters [12]004 today for now they will be in offline mode so shouldn;t serve any traffice but you will see them in puppet-merge if nothing elses. any issues please let me know [13:28:27] ack [13:28:37] jbond: can we merge changes right now? [13:29:26] vgutierrez: yes [13:29:31] cheers [13:33:08] jbond: hmm I don't know if it's expected but puppet-merge showed additional changes being applied to puppetmaster[12]004 [13:33:43] https://www.irccloud.com/pastebin/PgjK3Q4v/ [13:33:50] 2003 VS 2004 output for your reference [13:34:02] * jbond looking [13:34:46] vgutierrez: yes that is expected sorry you beat me. that happened because i hadn;t re-synced [12]004 with the most recent sha1 [13:34:54] no problem :) [13:35:11] from now though it should be exactrly the same :) [15:27:33] elukey: hnowlan: guys i have 4 servers still on in B6 wcqs2001 ml-serve2006 restbest2024 n mpas2009 [15:28:21] ah okok I can take care of ml-serve [15:29:04] inflatador: for wcqs ^ [15:30:47] I see kafka-logging2003 not responsive, but it seems in rack C-something [15:31:17] wait wut, I wasn't aware of restbase hosts in B6 [15:31:20] taking it down now [15:33:06] jayme: i left a message in -search too about wcqs [15:33:20] nice, thanks! [15:33:58] jayme: i can ping the rest of the sre in there if urgent [15:34:15] herron, godog o/ - kafka-logging2003 seems not responsive in serial console, ok if I powercycle it? (not part of maintenance but seems down) [15:35:39] RhinosF1 jayme working on wcqs2001 now [15:36:00] ty [15:36:40] (powercycled kafka-logging2003) [15:40:51] (host up) [15:53:17] elukey: I thought it was part of the maintenance today. Thanks for taking care of it! [16:54:48] my memory is failing...i keep trying to use noc.wikimedia.org to find out what the live deployed versions of mediawiki are currently. Where does that live now? [16:55:08] i suppose can parse it out of mediawiki-config, but i think there is some wmcloud.org site that doe sthat [16:55:44] https://noc.wikimedia.org/conf/ [16:55:51] https://versions.toolforge.org/ [16:55:54] ebernhardson: you meant https://versions.toolforge.org/ ? [16:55:56] damn Reedy [16:56:14] ahh, yea i was thinking of versions.toolforge.org. but i hadn't noticed it was also in the conf subpage, that makes sense too. thanks! [16:56:45] Icinga question that mutante or someone similar might be able to answer: I'm manually cleaning up after T306469 and need to remove the /etc/nagios/nrpe.d/check_uwsgi-striker.cfg check. Is there anything special to do other than rm'ing the config file to make icinga forget about that test? [16:56:45] T306469: Convert Striker to a container-based deployment - https://phabricator.wikimedia.org/T306469 [17:21:54] bd808: puppet should automatically get rid of it when the definition is removed from the puppet manifests [17:24:00] taavi: *nod* I do see that ultimately it was an `@@nagios_service()` exported resource so that makes sense that the icinga nodes themselves would have already lost that when the resource was removed from the server manifests. [20:23:20] Hello team, out of sheer curiosity, where do the names eqiad, codfw, etc. Come from? I was unable to find in the Wiki if they have a meaning. [20:23:45] denisse|m: https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions#Data_centers [20:24:35] cdanis: Thanks a lot! :D [20:25:00] unfortunately the wiki will not help you at all with the much more contentious questions of how to pronounce them [20:25:10] although 'dreamers' seems to have become canonical for 'drmrs' [20:25:38] Yeah, and 'dreamers' sounds beautiful! <3 [20:26:08] it sounds much more beautiful than 'code eff dub' or 'ekkiad' for sure [20:26:28] I think the only big remaining question is "eck-ee-ad or eck-wee-ad" [20:27:19] I also would have argued for codal over codfw just for pronounceability, but it was 5 years before I joined so nobody asked me [20:27:33] that would also help eck-dal [20:27:45] I refuse to even talk about eqdfw [20:27:55] what about ul-san? [20:28:04] that's the sort of noise I would make with my keyboard when I'm very excited about something, it's not a real place [20:28:20] (it is a real place and my colleagues have worked very hard to make it so and I appreciate them very much <3) [20:28:54] SAN is San Diego, and I *know* you are just messing with me but I have to say it anyway [20:29:23] DAL meanwhile is a perfectly good airport *in Dallas* [20:29:26] you're right [20:29:35] let's switch to the nearest pronounceable airport code [20:29:49] hey does anyone mind if I depool ul-oak real quick? [20:30:17] this is a false equivalence because DAL also CLEARLY MEANS DALLAS whereas argh argh argh [20:30:24] does anyone want a friend? I'm done with this one [21:23:36] us-east-1, us-central-2, eu-north-3, us-west-4, asia-southeast-5, and eu-south-6 [21:23:48] it's perfect! [21:24:58] amsterdam = eu-north ??? [21:26:05] I'm *pretty sure* I said it was perfect 😠 [21:29:16] us west isn't in Alaska either ;) [21:31:49] eu-west is Ireland, I know that [21:36:24] well I'd say 'west europe' is a much better way to describe the Netherlands than 'north europe' would be [21:42:30] https://en.wikipedia.org/wiki/Northern_Europe#/media/File:Europe_subregion_map_UN_geoscheme.svg