[01:42:23] * bd808 off [08:43:50] morning [09:06:27] morning [10:12:12] morning, still not 100 here, but more or less around (sorting out paperwork) [10:35:53] folks we've our regular scheduled meeting later today, but we also have a meeting tomorrow on design stuff / longer-term planning [10:36:19] do you think we need both? I've not much day-to-day stuff for today's one so was thinking we might be able to skip [10:36:29] but happy to do it if we think it's needed [10:46:52] hey, on my side I have nothing either, +1 from me to skip (non-binding xd) [10:52:54] topranks: the only day-to-day thing that comes to my mind is how to proceed with the cloudsw updates, so up to you and dcaro if you want to talk about it today or not [10:53:02] a.rturo is also out today [10:57:09] ok guys let's skip it then I think [10:58:27] I roughly know where we are with the switch upgrades and issues with ceph, let's double-back on that in 2 weeks time and decide how to proceed [13:35:15] taavi: I've merged the patch to allow disabling cron, you can deploy the Hiera change for cloud vps [13:46:30] topranks: just to double-check, you'd like me to reschedule today's network meeting for two weeks from now? [13:47:36] andrewbogott: if you can yes, but we can have it if you think we should [13:47:49] general consensus was to move it forward and discuss the design stuff tomorrow [13:47:57] rescheduling sounds good to me [13:49:29] Done, google permitting (now it's in the midst of daylight confusion time) [13:49:54] huh, I hadn't even realised thanks for the heads up :) [13:52:00] Also I got a lot of auto-declines for the new time but I'm not sure why. That's not in the middle of the sre summit is it? [14:13:32] I'm on holiday that week but that's just me [14:22:47] 'k [14:30:18] taavi: are you interested in a case where the cloud-vps proxy api returned a 500? Or can you point me in the right direction of the logs for that? [14:31:43] proxy api? either way is fine with me [14:32:10] the logs would be on proxy-03.project-proxy, journal for uwsgi-invisible-unicorn.service iirc [14:32:50] I'll look there for starters [14:42:17] hm, 'MySQL server has gone away' [14:42:45] hmm, so some timeout probably needs updating then [14:43:57] hopefully [14:44:21] Full stack trace is in T358672 but I don't think there's anything more to learn than that [14:44:22] T358672: cloud-vps dynamic proxy returns 500 - https://phabricator.wikimedia.org/T358672 [14:53:33] we're thinking that flask_sqlalchemy just opens one db connection and holds it forever? [15:17:30] andrewbogott: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007354 [16:24:12] while laying on bed I have been pondering the question: why not running toolforge on hardware servers? I can't follow up now, just wanted to share that question that is otherwise entertaining me and helping me go through fever [16:26:00] in the checkin yesterday we briefly discussed a similar idea :) [16:29:26] are you thinking of a k8s cluster running on metal servers instead of VMs? [16:30:29] to me the questions more 'why would we?' as we have an awesome compute-as-a-service platform which seems much more flexible [16:35:34] andrewbogott: puppetserver-deploy-code [16:38:18] taavi: yes I'm also not sure about pros/cons. the only compelling scenario I can think of is the one where a "k8s team" in WMF creates a k8s setup we can easily reuse, and is still flexible enough. [16:52:50] Do we have usage data on our hypervisors? [17:05:56] We don't have usage-over-time data, because running that turned out to put a huge load on rabbitmq. We do have a 'right now' usage dashboard someplace [17:06:43] And I agree with your comments that the numbers in that doc are very hand-wavy, I'll see if I can refine and/or add some more qualifiers. [17:12:45] My concern with the document is that it appears to be written with a bias, rather than a disinterested voice. If that is the intent it is probably alright, though it didn't seem to be [17:16:40] Yeah, I meant it to be more of a manifesto than a balanced comparison of options :) I would still like the numbers in it to not be completely false though! [17:17:31] Well, to be more specific: IMO decisions like this should be primarily based on ideology like privacy/security/foss/etc rather than pragmatics like money and ease of use. [17:17:56] So the 'Value' section is hoping to be more of a "and anyway it doesn't cost us more" backstop rather than an actual motivation [17:18:00] does that make sense? [17:18:44] That's the issue, I think it does cost us more [17:19:14] ok! I could definitely believe that -- when I wrote all that I was taking Nicholas's assertions as gold [17:19:40] But you may be the only one here who actually understands aws pricing [17:20:29] I'm also leaning heavily (mentally) on https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0 which asserts that for non-bursty loads the cloud is Too Expensive. [17:21:45] Oh I've argued about those with him before. The first problem is mentioned in the comment, those are "spot instance' prices, i.e. the ones you rent by the minute, rather than the equivalent of owning them "reserved" which cuts 1/3 off the price right away. The other, larger, issue is we run our systems dead cold. We have maybe 10% utilization. If we don't have to worry so much about sudden need (because we could just add it) we [17:21:46] could cut our footprint down to probably 1/5 what we're running. So that's .66*.2*900000 = 118800/year in AWS [17:22:42] And our 200k/year is also not accurate, as it is just the amount we pay for hardware, not the effort, energy and space that goes into that hardware. With AWS (or whatever cloud provider) that is all included [17:23:51] when you talk about 10% utilization, you mean VMs that have reserved cpu and ram but aren't currently consuming them, right? [17:24:27] So in the scenario where we 'reserve' hosts, are we reserving hypervisors which /we/ carve into VMs, or renting the vms themselves? [17:25:08] now I'm just getting curious about aws pricing works :) [17:25:37] That's a different question, in my view we don't need nearly as much hardware if more is easily and rapidly available (as it would be in a hosted cloud provider) [17:26:07] Basically they charge you by the minute, or by the year, if you rent it for the whole year it is cheaper per minute [17:26:59] While I think we would be good in getting rid of openstack, I haven't factored that into anything that I'm saying so far. My comments are based off of we still would manage openstack just on hardware that we don't deal with [17:28:05] Got it, so you're talking about more a 'metal as a service' scenario. And that we wouldn't have to pay to keep slack in the inventory since we can expand our footprint on demand. [17:29:46] Which sounds pretty good to me, in many ways :) [17:30:10] Yes, I'm discussing as though we lifted our whole infra and put it in a cloud. So a cloud within the cloud at that point [17:30:54] got it. That makes this an easier apples:apples comparison. [17:31:14] I'm going to rewrite that section and then I'll nudge you for a reread if you don't mind. [17:34:39] Sounds good. At what I'm seeing of current usage we're using about 10% cpu, and about 34% memory. So what was mentioned was that we could easily cut cpu purchase in half and provide the same. And in the case of AWS if you get larger systems, they eventually give you "free" extra memory (1cpu:4G) https://aws.amazon.com/ec2/instance-types/t3/ for an example [17:35:12] Which would give us more like a 20%cpu:34%memory balance which is much better than what we do currently [17:40:12] Rook, do you have any thought about transit costs? bd808 suspects that transit costs would gobble up all our money in a public cloud, since right now we rely heavily on free peering (and, of course, we're co-located with the primary endpoint form most wmcs transit, which would no longer be true in a cloud) [17:40:53] How much data do we move? [17:44:09] I fear that any answer I give you for that would be so hand-wavy as to be useless :( And also transit costs in a public cloud would depend on a ton of decisions, e.g. w/not the replicas stay in eqiad. [17:44:23] * andrewbogott hoping bd808 will pop in with more input [17:44:49] btw rook, do you have similar magic to reduce Nicholas's storage numbers, or are they closer to accurate? [17:44:50] Is it in the TB/month? [17:45:57] https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer would be the chart. Basically if we're under 10TB/month it would be less than 900$/month it's a little cheaper per data after that. Oh and apparently that is only for data out, not in, in is apparently free [17:46:18] actually, storage basically isn't broken down in that chart, it assumes only local hard drives on each VM [17:46:28] there's nothing in there at all about cinder or db storage as far as I can see. [17:46:29] I did look at the storage at some point...what were the numbers... [17:47:18] How many TB of storage do we have? [17:47:42] According to the chart it is 46T, though sounds like we have more? [17:48:23] https://aws.amazon.com/ebs/pricing/ at any rate has some numbers [17:48:51] hm, this dashboard is baffling, but around 90TB [17:49:16] Replicated, but I assume if we purchase storage from a cloud provider they worry about keeping it replicated? [17:49:49] Probably 90k/year for General Purpose SSD [17:50:28] I don't know what guarantees aws has. Would of course vary by the provider [17:51:25] I got 170k but it's the same order of magnitude at least... [17:51:38] (that was for 90 1Tb volumes which is maybe an expensive case) [17:52:03] Oh I just multiplied up the /G cost [18:24:06] Rook: ok, I rewrote the section, it's still extremely vague but hopefully will read as less obviously wrong to you [18:24:33] Getting fully reliable numbers would be a pretty huge effort [18:26:44] I agree accurate numbers is going to be hard. Especially in the thought of using some other cloud's services rather than building our own [18:26:54] Reads honest to me. Thanks for updating it [18:40:07] thanks for reading! [18:49:00] * bd808 lunch