[01:53:17] ryankemper: is it being decommed or something? [02:40:07] bblack: no not permanently but it might take a few days to resolve the underlying problem [07:09:04] 10Traffic, 10SRE, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Legoktm) >>! In T294800#7473846, @Joe wrote: > If anything, I think we should go in the other direction, and progressively and drastically red... [07:32:28] 10Traffic, 10SRE, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Joe) >>! In T294800#7480542, @Legoktm wrote: > Sidenote, I wonder if we can get some basic stats from the envoy metrics about how many POST r... [07:34:51] 10Traffic, 10SRE, 10serviceops, 10Performance-Team (Radar): Reconcile MediaWiki POST timeout and Varnish/ATS timeouts - https://phabricator.wikimedia.org/T294800 (10Joe) Let me add another data point: Of those 8 requests over 175 seconds, only 2 were to POSTs to Special:Upload. [08:11:39] 10Traffic, 10Commons, 10MediaWiki-Uploading, 10SRE, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10Legoktm) @AlexisJazz is it OK... [09:12:58] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster [09:17:12] ^^ completing reimaging the 2 ulsfo nodes cp403[46]. Expected Varnish reachability alerts [09:22:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp4034:9331 is unreachable - https://alerts.wikimedia.org [09:47:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4034:9331 is unreachable - https://alerts.wikimedia.org [10:01:26] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4034.ulsfo.wmnet with OS buster completed: - cp4034 (**WARN**... [10:28:36] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster [10:38:57] (VarnishPrometheusExporterDown) firing: Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [10:48:57] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4036:9331 is unreachable - https://alerts.wikimedia.org [11:26:00] 10Traffic, 10SRE, 10Patch-For-Review: Deploy Wikidough: Experimental DNS-over-HTTPS (DoH) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) [11:28:36] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: Q1:(Need By: TBD) rack/setup/install cp403[3-6].ulsfo.wmnet - https://phabricator.wikimedia.org/T290694 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp4036.ulsfo.wmnet with OS buster completed: - cp4036 (**WARN**... [12:20:58] 10Traffic, 10Commons, 10MediaWiki-Uploading, 10SRE, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10AlexisJazz) >>! In T280926#74... [12:27:17] vgutierrez: I don't know if it's in your radar but I got "Let's Encrypt certificate expiration notice for domain "mx1001.wikimedia.org"" [12:30:49] Amir1: usually this happens when we add/remove a domain from the certificate's SAN [12:31:25] okay, if it's ignorable, then my bad. Sorry [12:31:36] the way LE works is that if the list of domains change the new cert is considered different from the previous one and so we're leaving the old one to "expire" but has been substituted by the new one [12:31:47] what we usually do is to check/know if that has happened recently [12:32:26] or check the cert [12:33:59] Amir1: in this case the cert expiring has mx2002 that was removed AFAICT [12:34:06] see puppet's 3e2a9f26cb [12:34:39] (SNI list) [12:35:12] yeah, mx2002 was just a temp install [12:48:24] Indeed [13:47:55] XioNoX: we now have our dns instances for drmrs included in our puppet ready to image them not sure if we already have drmrs' basic public routing [13:48:39] mmandere: we don't have the routers yet, so no public routing for now [13:49:24] mmandere: is it a blocker? [13:49:47] XioNoX: can we do some kind of temporary routing for now? All we need is the space advertised out of one random working transit or whatever, and a manual static route out of some interface (maybe the same for simplicity) [13:50:00] yeah, it blocks DNS/NTP/etc infra from coming up normally and working [13:50:54] bblack: setting up routing is not the problem, but setting up all the firewalling around it is [13:51:15] bblack: the switches don't have the same ACL capacity than the MXs for example [13:51:29] ok [13:51:49] well, we're mostly talking about the public-side ACLs I guess [13:52:08] yeah [13:52:34] plus the complexity of setting up that one of routing exception [13:52:35] the dns6 nodes (recdns+authdns+ntp) are the ones that need the access, and they do have ferm rules and such. [13:52:40] bblack: note that 10.3.0.1 in eqiad should be reachable from drmrs [13:52:54] yes, it is [13:53:45] we're just trying to get the infra up in-order and using its own local parts as designed [13:54:46] there's not much point bringing up lvs/cp right now without routers, either. bast is also useless without routing, and wikidough, and I guess we could bring up ganeti clusters and put prometheus on them. The dns6 are useless without it as well. [13:55:47] so yeah, there's not a lot of pragmatic progress we can make without it, aside from doing some temporary "use eqiad for infra services" patch and bringing up the ganeti clusters and running some meta-instances there like prometheus. [13:55:49] getting them up to test the hardware and have everything ready to roll when we have external connectivity [13:56:53] right, but we could do that just placing all hosts into role(insetup_noferm) (and that's probably simpler really, since so many things in their normal roles would be broken anyways). [13:57:09] could we have all traffic from drmrs somehow routed through eqiad for example??? [13:57:29] jsut for testing ofc [13:57:31] I guess we have some mismatched expectations here: I wasn't expected full-on BGP and ability to optimize, etc, but I was expecting to be able to do basic public-subnet routing so we can move forward with all the real puppetization+testing [13:59:38] bblack: we were supposed to receive the routers around this time, but they got postponed to either late December or March [13:59:55] yes, I'm aware of the lack of routers. [14:00:10] so the expectations have to change [14:00:30] but perhaps volans idea might work? the drmrs public nets could be added to the general ACL stuff in router cfg mgmt, and then have drmrs do its public routing back through eqiad? [14:00:34] I donno [14:00:55] yes, that's one of the options [14:01:00] XioNoX: yes, we already absorbed that change, I thought, but I was still expecting we'd be able to do a basic default route out a working transit. [14:02:39] we're looking at the alternatives we have, for temporary testing as well as production [14:05:32] (also, dns600[12] have ferm for everything that's not public-access anyways. Do we need ACLs as a blocker? if there was some genuine attack traffic causing issues we could always just drop routing since the site isn't live). [14:08:30] bblack: even just to protect the routers themselves, yes. And it's not just attack as DDoS, but the routers could get compromised, etc... [14:09:49] bblack: would it be a temporary test or something that need to be on all the time? [14:10:36] we weren't intending to be temporary, no. If it were temporary, we'd probably break some things when it's turned back off and have to silence alerts / reconfigure things / etc just to avoid notifications or log spam or whatever. [14:11:26] for the routers/switches, could just drop a simple ACL there to not allow any traffic to them on the public side? [14:11:41] I donno [16:06:57] bblack: re-reading the conversation, I want to clear out miss-communication :) First it's not "just" or "simple", whatever option will need to be thought out properly. Rushing it could bring security risks, or instability in our infra. We're looking at the option but it won't be done overnight (probably within next week, or at least we will have better clarity then). Especially that the preferred options (and the time we invest in [16:06:57] temporary solution) change depending on how far away the routers are. Furthermore I was under the impression that this was not a hard blocker to setup most of the site. If that's true it might be a good thing to do while we figure out the public routing side. If not we can only be more patient [16:08:55] For example if the routers won't be there before March, it might be better to focus on a solution that can server prod traffic before then, etc. [16:08:57] XioNoX:either way, we have time to figure out what the best path is all around. For now, we'll proceed with ganeti6xxxx as our first installs to sort out basic dhcp/tftp/installer issues, since they're private-network anyways. We can circle back to this e.g. Monday/Tues. [16:09:24] see how it goes and what changes next week [16:10:50] cool, yeah, thanks! [16:16:43] 10Traffic, 10Commons, 10MediaWiki-Uploading, 10SRE, and 3 others: Various errors when trying to upload large files (Could not acquire lock, Service Temporarily Unavailable, 503 Backend fetch failed, 502 Next Hop Connection Failed) - https://phabricator.wikimedia.org/T280926 (10Legoktm) 05Open→03Resolved [16:16:46] FYI I'm out tomorrow as usual and will be out on Monday