[10:49:59] T300000 [10:49:59] T300000: Use capitallinkoverride for gadget namespace - https://phabricator.wikimedia.org/T300000 [10:50:48] look at the bug number... we definitely created tasks faster than we can solve them [12:20:25] I want to use some bashisms in modules/install_server/files/autoinstall/scripts/late_command.sh - is there any reason I shouldn't adjust the #! line to /bin/bash? [12:21:06] (bashisms - set -E, trap ERR, [[ ]], ;& in case ) [12:21:15] Emperor: this runs in the d-i environment, and I don't think bash is available there [12:22:11] it's all busybox IIRC [12:22:22] Bah. [12:22:57] yep what para.void said AFAIK [12:23:05] I have been spoiled by FAI, clearly :-/ [12:23:23] what do you need to do? most things are done by puppet later on [12:25:29] on the ms-be nodes, if [ -b /dev/sde1 ] && [[ $(lsblk /dev/sde1 -no LABEL) =~ ^swift-sd ]]; then mount -o ro /dev/sde1 (to a tempdir), call stat -c %u on mountpoint/objects, ditto -c %g, unmount it again. [12:25:52] to parameterise the existing call to in-target {group,user}add swift [12:26:11] is that workaround still needed? [12:26:15] Yes [12:26:36] wow, it's been many years I think [12:26:42] the swift fleet is split between 130 and 902 [12:27:11] new hosts get 902 which is where we want to eventually end up, but in the mean time when we reimage a swift backend, we need to make sure the swift uid/gid match the filesystems [12:28:08] busybox seems to lack both blkid and lsblk which may make this more tricky. [12:28:13] yeah I remember that whole saga, I'm just surprised the old uid hasn't been phased out (by our 5-year lifecycle) yet :) [12:28:30] We have 18 backends with 902, 59 with 130 [12:28:35] this may be gross, but you can write stuff to somewhere in /target/ and then call in-target bash -c [12:30:37] 18 backends with 902 and 59 with 130 doesn't track with our refresh cycle... I wonder if that means that this d-i snippet hasn't been working [12:31:12] I mean I guess I could just mount -L swift-sde1 "$tmpdir" and handle that failing, but that's kindof gross; and the lack of ERR trap means I'll have to set +e and manually check each stat call to avoid the risk of the script exiting with the FS still mounted [12:31:32] oh https://gerrit.wikimedia.org/r/c/operations/puppet/+/575217/3/modules/install_server/files/autoinstall/scripts/late_command.sh [12:31:33] paravoid: all the newest nodes have 902 [12:31:35] I see now [12:32:52] original change was in 2016, but with that change in 2020 we rebooted it, so this will stay around until 2025, got it :) [12:33:18] (I think that's probably still probably better than writing bash fragments into /target ) [12:33:40] how long does it take to mass-chown a node? :P [12:34:12] take it offline, find -uid -exec chown, change /etc/passwd, repool [12:34:30] god.og estimated a couple of days which is why we haven't done so [12:34:34] (per host) [12:34:45] ok :) [13:22:59] yeah we were so close in 2020 :) [14:11:13] jbond: I think your last puppet change broke ferm? [14:11:44] Ah, I just saw that you just reverted it [14:11:48] Puppet runs fine now [14:11:52] \o/ [14:14:39] yes just running on the failed nodes now [15:06:09] lots of mechanisl spam in -ops, so echoing here: [15:06:19] there were two pages for ncredir in drmrs - please ignore them, sorry! [15:11:11] You are ruining Manuel and I's plan to ruin -ops with depool/repool logs 😠 [15:39:37] thanks for the heads up, I was away and just saw those [16:05:09] jayme: Just a heads up the eBGP config for the session to kubestagemaster1001 didn't complete properly on cr2-eqiad earlier. [16:05:23] I had to force disconnect the previous config session and push it through again. [16:05:35] No issues, config is on the router now but it's only gone in in past few minutes. [16:05:46] cr1-eqiad appears to have had the config applied correctly earlier [17:32:31] Are there documents/manifestos/analysis/etc anywhere answering the question "why does WMF host its own hardware"? 10 years in and I'm still not sure if it's a pragmatic choice or an ideological one. [17:34:15] it is both [17:34:47] the budget for our CDN egress alone would be comical at any cloud provider [17:35:05] pa.ravoid can speak to all of this better than I can though [17:35:33] the place that _should_ be is https://en.wikipedia.org/wiki/Wikimedia_Foundation#Technology I think [17:35:43] it talks about how we moved from 1 server to multi-dc [17:36:27] the original reason to move out of Flordia: "citing reasons of "more reliable connectivity" and "fewer hurricanes".[62][63]" :p [17:37:07] oh same [17:37:17] lol [17:42:45] hurricanes sucked [17:42:50] cuz we went to genny power during some [17:43:00] and that building couldnt run servers and the hvac properly so hvac suffered [17:43:02] so servers overheated [17:43:25] https://web.archive.org/web/20110806105849/http://leuksman.com/log/2009/02/12/your-donations-at-work-new-servers-for-wikipedia/ [17:43:25] many a hurricane i was onsite with the door to the hall propped open and a box fan pointed at our racks [17:43:39] atleast 3 times. [17:43:54] then i just got to sit there cuz the door was open and no real security otherwise. [17:43:55] "the" mysql database [17:43:59] that sounds like a shitty action movie to be featured in robh [17:44:15] cdanis: to be clear, I'm not soliciting opinions here (as I have some as well) but am wondering if those opinions are actually written down anywhere [17:44:21] like die hard in that its a balding white dude in an office building on holidays he doesnt wanna be, yep [17:44:28] but no machine guns and i kept my shoes on [17:44:46] I would guess you could find some of that somewhere in the archives of wikimedia-l [17:44:50] and the only german involved was Jens and he was on our side. [17:45:01] no hans gruber [17:45:18] similar to https://craphound.com/overclocked/Cory_Doctorow_-_Overclocked_-_When_Sysadmins_Ruled_the_Earth.html though :) [17:45:20] andrewbogott: yeah, and I know that pa.ravoid has done a bunch of actual analysis on this at a few different points :) [17:47:19] one time the parking garage of 200 franklin clogged its drainage and i couldnt pull in to park [17:47:48] had to park on street and run inside, luckily the parking garage gen sets were on level 2.5 so no danger to them being flooded [17:47:56] that building sucked. [17:49:50] * bd808 is reminded of generators in the floodplain at $DAYJOB-1 [17:50:38] I would say search wikimedia-l archives (search works since mailman3) for words like "cloud" to find some old thread about it. [17:50:45] by the way this made me find " [17:50:47] [Foundation-l] (press release) Wikimedia Foundation selects Watchmouse monitoring service " [17:50:58] 10 years, 11 months ago. for the people shutting it down these days :) [18:00:50] Whenenver I see https://xkcd.com/705/ - I tend to think of robh. [18:01:50] ;) nice [18:02:13] lol [18:04:57] mutante: I think that if the only record of our priorities is in a mailing list archive that's a pretty solid 'no' to my original question :) [18:06:34] well, at least that was public and searchable as opposed to a google doc [18:07:13] true! [18:09:09] to really answer your question I think the truth it is spread out across wiki talk pages, wikimedia-l, the old "staff" mailing list we lost before a certain date, IRC chat logs that may exist on the wmfbot, old office watercooler chats and there is no obvious single place to find it. and it's a good point that there should be one [18:11:19] not sure if a bunch of tech writers we hire could do that, using all those sources [18:22:50] <_joe_> andrewbogott: if your question is "is there a policy against using a cloud", I don't think there is one. The strongest argument against using a cloud *properly*, which means using all of their actual services, is the WMF's guiding principles. So the choice to not use the cloud has always been guided by considerations of cost, freedom/openness, avoiding vendor lock-ins, privacy protection, security (not [18:22:52] <_joe_> necessarily in that order) [18:23:53] <_joe_> as cdanis said, others have better evaluations, but I think I went on record several times in public speeches on why we aren't using cloud infra right now [18:24:55] (FWIW I also think there are some parts of our infra where it would make a lot of sense to use a public cloud) [18:25:16] (although I'll stop giving my opinion there, for a few reasons ;) [18:27:16] <_joe_> oh, yes, of course all of the above applies to "hosting our whole infra for wikipedia on the cloud" [18:28:28] Krinkle: this tracks. [18:28:37] particularly the muttering to oneself [18:28:44] walking through datahall [18:56:47] yeah there's a lot of legal / censorship implications to running various parts of our stack in public clouds, too [18:57:05] things would definitely work out to our and our communities' detriment in some edge cases [18:57:26] you can pretty safely carve out some narrow exceptions to our public-cloud-aversion, though [18:59:07] if the code/project/service is (a) not handling PII of real human users's live queries + (b) not critically important to the actual operation of the sites, at least [18:59:29] we can imagine monitoring services, or anything that just crunches public data of ours to generate other public data [18:59:46] or simulations, test environments that use fake user data, etc [19:01:53] but I think there are a laeyered number of legitimate arguments against having anything that's in the live flow of site service and/or handling PII from being in public clouds. [19:03:01] [and we have had to defend that formally in the past. A succession of new-at-the-time C-levels have asked about it, we've written a document to explain it to them and then re-used it for the next one, etc. I don't think any of that was ever really "public" though] [19:03:31] <_joe_> that's kind of a pity [19:04:39] the traffic edge sites have been one of the hottest topics for this. They're relatively-stateless, and would greatly benefit from being able to operate from public cloud edges cheaply in lots of places quickly. [19:05:23] of the key arguments against it in that particular case, the ones that stand out (all are mentioned to some degree above): [19:06:28] 1) Bandwidth cost - we pay a lot less for our transit links than the cloud providers would charge for the same. This is a weak argument though, as likely some of the providers would either (a) be willing to donate the bandwidth (or all) costs or (b) offer a bring-your-own-transit-contracts/links option. [19:07:49] 2) The PII-exposure risk - no solution to this is perfect (including if you ran your own hermetically-sealed and faraday-caged buildings), but there's certainly a higher level of risk in running on a public cloud versus deploying our own hardware in shared datacenters. [19:08:28] the risk we face the way we operate today is more on the level of someone with a lot of resources pulling off a stealthy physical attack on servers in a public space. [19:08:31] what a backscroll to read [19:09:16] but in a pulic cloud, there's a lot of questions both about employees of the cloud service accessing from underneath, and of course all the modern side-channel attacks that might work from other customers' instances. [19:10:12] There's also the question of a .gov agency anywhere in the world legally compelling the cloud provider to provide access to our PII from beneath the VM layer, which they're probably more wliling to comply with than our org would otherwise be receiing a simoilar letter, because they want to preserve business access to everywhere, etc. [19:11:26] 3) The censorship angle. Basically like the $random_government compulsion for PII, they could also compel one or more cloud providers to block access from various countries, specific networks, or specific IPs, to our projects, perhaps without even letting us know or involving us in the decision. All the same interests and compulsions apply. [19:12:26] it's not that we completely lack related risks the way we operate today, it's just that operating form public clouds greatly increases the attack surface of it all, and reduces our ability as an organization to even have a grip on whether and how it might happen. [19:14:22] (sorry for the rampant typos, it's what I get for trying to type fast from an odd keyboard angle while eating!) [19:16:11] and all of that's from the angle of running our own CDN stack on public cloud VMs. I think the arguments get stronger if we were actually to use saas stuff like cloudflare's CDN, instead of just hosting VMs. [19:24:08] there are some options in between colocation and public cloud virtual machines that might help mitigate some risks too, e.g. dedicated bare metal "cloud" hosts, which are just bare metal servers provisioned via a providers api and billed hourly/monthly/whatever [19:24:39] yeah but we'd still need our own address space and transit, I think [19:25:12] it would just be a swap from "negotiate a dc contract and ship all the purchased hardware" to "negotiate a service contract and spin up hardware that's already there and leased to us" or something [19:25:35] might buy us some speed, would greatly depend on the details how it worked out [19:26:18] yeah totally depends, but if the terms were workable, in theory it could buy speed in deployment/scaling into many different locations [19:28:13] there's probably still some risk tradeoff sliders too, it highly depends :) [19:29:10] in any case, when it comes to the edge pops, our rough strategy has been for a long time to only build out a limited number of them like we have today (but we'd still like a few more than we have now) [19:29:47] and then branch out from there with a "mini-pop" concept that would be cheaper and smaller. 1-2 hardware nodes that can be placed in simpler hosting in further-flung places quickly, and be less-resilient. [19:30:22] and a "metal cloud" sort of offering might work for those, too [19:30:45] but in either case, one of the hurdles is going into a less-secure hosting environment and not putting our TLS private keys at greater risk as well. [19:31:10] cloudflare has some blog posts on "Keyless SSL" that talks about how to do that (how to keep the private keys off the edge servers). [19:31:45] but not all parts of the tech needed to make that work are open sourced. it's been on our back-burner list for some time to see if we could ever get something like that up and running for us feasibly, to enable taking more risks with deployments. [19:33:03] the idea with the real-vs-mini pops is that a reasonable network of real pops can provide a baseline in terms of traffic handling and latency and serve as our ultimate fallback/default. [19:33:55] the minipops would be smaller in scale/scope, serving one large customer network or a few small nearby countries. they'd backhaul to the nearest few real pops, and they'd always be optional (in the sense they can fail and we can disable them with no more than just an accept user-perf impact) [19:34:12] *acceptable [19:34:38] but even in a world where we have lots of these cheap mini-pops, I think we're still missing ~3-ish real pops from our ideal state today. [19:34:59] (a couple in south america, and one more in asia, preferably more west asia to balance SG) [19:35:44] "no more than just an acceptable user-perf impact" -> adding to my list of use cases for client-perspective SLOs, thanks :D [19:36:08] :) [19:36:35] also, 3 more pops neatly lines up with the end of our short-sighted 1-digit datacenter numbering scheme, so it's perfect [19:36:51] drmrs took 6, so 2x south america + 1 more in asia are 7, 8, and 9 [19:38:08] okay, but how about this, as a shim out of our short-sighted 1-digit numbering scheme [19:38:08] just tell everyone that you planned to use hexadecimal in the datacenter number from the beginning [19:38:33] the problem is that we can't use e.g. "foo10000" because it would match an ill-advised "foo1*" someplace, right? [19:38:48] so why don't we go 5, 6, 7, 8, 90001, 90002, 90003, etc [19:39:06] plenty more digits to work with, and nobody has hardcoded "9*" anywhere because there's never been a 9 [19:39:13] it's foolproof [19:39:36] we could start counting backwards too foo0001 foo-1001 foo-2001 [19:40:00] or we just retcon that we meant to use hex in the first digit [19:40:15] yeah taavi's plan is good too [19:40:35] oh yeah, sorry I was typing as I was reading the first new line :) [19:41:56] so what I'll do is, I'll introduce a daemon called puppetd [19:42:01] and after you've used all hex characters, you get base 64 and 'cp/0001.foo.wmnet' [19:42:06] and I'll run it in datacenter #11 [19:42:19] that way I'll have puppetdb001, and it's totally different from puppetdb1001 [19:42:29] tbh after 10 it's probably too many to keep track of, the fqdn has it for wmnet hosts and maybe site specific subdomains for wikimedia.org could become a thing too [19:42:48] puppetdb👁001 [19:43:12] puppetdb⚠️001 [19:44:26] hmmm emoji flags... [19:45:43] that seems like the perfect solution to remembering which DC is where :D `cp🇫🇮001.🇫🇮.wmnet` is the first cache proxy in Finland and so on [19:46:49] relatedly, while it's not actually "in production" yet and no real users should be using it, FWIW if you hack your DNS or your curl commandline, drmrs does have live wiki service now. [19:46:56] bblack@haliax:~/repos/puppet$ curl -s --connect-to ::text-lb.drmrs.wikimedia.org -I https://en.wikipedia.org/wiki/Foo |grep x-cache [19:46:59] x-cache: cp6016 miss, cp6011 hit/7 [19:47:01] x-cache-status: hit-front [19:47:14] cool! [19:47:54] (not the least of the reasons it's not live yet, is we don't have "real" routing yet or even real routers, just a default route out one working transit) [19:51:00] nice! [20:05:06] bblack: WMF teams frequently decide to use public clouds for various projects. It sure would be nice to have something public we could refer them to about when it is/isn't OK to do that. [20:21:54] yeah I know [20:22:56] beyond that, I think it would be helpful if we acknowledged that public cloud use is a reality and had some team (preferably in SRE) managing our use of them centrally, so it's not ad-hoc. org level accounts that are shared, policies and data transfers that make sense, sharied IAM, etc... [20:23:10] but getting all that running is also a heavy lift [20:23:39] we don't really have the resources sitting around idly to take that on [20:40:01] FYI - I'm merging a puppet change that will add some new icinga checks for drmrs, which I'm then going to downtime [20:40:19] the plan is to control the puppetization, catch them early, and downtime before they alert [20:40:28] but just fair warning, and sorry ahead of time for any false alarms!