[10:33:31] 10Wikimedia-Apache-configuration, 06serviceops: 2030.wikimedia.org is a double redirect - https://phabricator.wikimedia.org/T367013#9878895 (10akosiaris) Thanks for the historical aspect @Dzahn. Given that perspective and the fact the double redirect isn't apparently considered a problem, my opinion is that it... [11:17:42] hello traffic. I'm about to remove a service from LVS (https://phabricator.wikimedia.org/T345274) according to https://wikitech.wikimedia.org/wiki/LVS#Remove_a_load_balanced_service - any objections? [11:18:32] jayme: no objections. if you need someone from Traffic to be around, we can do so in an hour [11:19:17] sukhe: I think I'm good (having done it a couple of times), thanks. ofc. now I jinxed it and it will go up in flames - sorry ;) [11:19:40] jayme: :) feel free to go ahead then [11:20:34] ack [12:46:21] jayme: is that service running on eqiad? [12:46:46] please don't restart pybal on lvs1020 without pinging me first [12:47:19] oh... that was ~1h20m ago [12:47:24] I'm assuming it's done :) [12:47:36] yeah per SAL too [12:47:43] https://sal.toolforge.org/log/gxQjB5ABhuQtenzv0sVH [12:48:04] was pybal restarted? [12:48:18] cookbook wasn't used [12:48:22] and it hasn't been logged [12:48:28] Active: active (running) since Tue 2024-06-11 11:38:02 UTC; 1h 10min ago [12:48:31] if a pybal was restarted in the woods and no-one gets p.aged...? :) [12:48:45] sukhe: sigh [12:49:50] jayme: next time please be verbose about restarting pybal :) [12:51:40] uhm...I logged everything [12:52:05] where? [12:52:07] and the cookbook is not in the docs... [12:52:16] -operations [12:52:27] failing to see any mention to pybal/lvs instances [12:52:46] ah, ofc stashbot was failing all the time [12:55:05] sorry, I did not bother checking if stashbot is functioning properly. If a cookbook should be used, that info is at least missing from https://wikitech.wikimedia.org/wiki/LVS#Remove_the_service_from_the_load-balancers_and_the_backend_servers [12:55:24] that's a recent addition, only mentioned it cause it !logs on its own [12:55:45] also there should probably be a note mentioning to ping you explicitely about lvs1020 [12:58:12] the lvs1020 thing was because of another unrelated issue [12:58:29] but in general I am updating the docs to suggest what to do in case the cookbook is not used [12:59:55] cool, thanks [13:00:27] godog: something is off with prometheus on drmrs [13:00:28] https://grafana.wikimedia.org/goto/bH8v2V8Sg?orgId=1 [13:00:50] vgutierrez: checking [13:00:58] lvs_realserver_clamper_drmrs.yaml has the expected content on prometheus6002 and curl shows that the instance is able to reach port :2200/metrics [13:01:10] and metrics are there [13:01:13] https://www.irccloud.com/pastebin/NOAmgSv9/ [13:03:01] (docs updated to log if cookbook is not used) [13:04:31] vgutierrez: doesn't look like prometheus has been told to scrape frrom cache_text but only cache_upload and ncredir [13:04:35] https://prometheus-drmrs.wikimedia.org/ops/targets?search=&scrapePool=lvs_realserver [13:04:42] in fact for those two data is indeed there [13:04:53] yaml file says otherwise [13:04:56] 06Traffic, 13Patch-For-Review: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466#9879617 (10Vgutierrez) [13:05:06] puppet failed to reload prometheus service? [13:05:07] ok checking [13:06:25] yeah looks like it, "fixed" with systemctl reload prometheus@ops I'll check why puppet didn't do it [13:10:14] can't find anything obvious right off the bat, though I'm tempted to stick a systemctl reload in puppet [13:10:49] our fault, we didn't checked that dashboard after running puppet on prometheus6002 [13:10:51] *check [13:10:57] thx :) [13:11:43] heh it is supposed to pick up changes by itself, and that usually happens, not checking is fair [13:28:17] yikes.. mss-clamping with iptables doesn't work [13:28:25] https://www.irccloud.com/pastebin/zZiK1Dh9/ [13:28:43] given the amount of packets that go through the FORWARD chain those rules should sit on OUTPUT instead [13:28:53] topranks, moritzm ^^ what do you think? [13:30:38] vgutierrez: it all depends, packets being forwarded from one interface to another will go through the forward chain, and not output [13:30:55] topranks: but that's a realserver.. no forwarding happening at all [13:30:59] packets generated by the system itself go through output [13:31:27] yeah exactly I was going to say I don't think forward probably matches this use-case (but not overly familiar with the ncredir boxes) [13:31:48] hacked that ferm rule manually: https://grafana.wikimedia.org/goto/-oP4048Ig?orgId=1 [13:31:53] MSS went back to normal [13:33:51] perhaps mangle table / postrouting is where they should go? [13:35:49] the counters show hits on traffic the host is sending to itself, which I guess is why it was so high to begin with in your graph (localhost traffic will use really high as mtu is 65k) [13:35:53] https://www.irccloud.com/pastebin/bEgMs3Lc/ [13:36:19] yeah.. that check runs on the same host [13:37:31] topranks: I just repooled the host, so now the counters also show hits on ens13 [13:38:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041645 [13:38:13] yep [13:39:43] that looks good to me. OUTPUT chain is fine for realservers I think [13:42:13] topranks, moritzm do we export iptables counters to prometheus? [13:44:01] no, we don't. there are some exporters, but nothing currently in use [13:51:32] ack [14:28:24] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880091 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=adbdaf29-9da2-42ea-b64e-fc6d141eaf9e) se... [14:43:40] FIRING: [18x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:43:56] noo [14:47:24] 06Traffic, 06SRE, 13Patch-For-Review: Add unique error IDs to 4xx responses - https://phabricator.wikimedia.org/T330973#9880141 (10TheDJ) I randomly found this. It seems this was forgotten about, even though most agreed it was a good idea ? A quick revisit might help bring a result to this or a decision to... [14:48:40] FIRING: [24x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:52:21] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880192 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=22e81c7a-3dde-4cd2-9376-bd003c744dc6) se... [14:53:40] FIRING: [34x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [14:56:45] 10netops, 06Data-Persistence, 06DBA, 06Infrastructure-Foundations, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982#9880204 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=d67744a2-77a0-40dc-aff6-4af804b0b5ce) se... [14:59:06] FIRING: [45x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:04:48] FIRING: [46x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:19:05] FIRING: [46x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:19:05] FIRING: [29x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:19:05] FIRING: [25x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [15:24:17] RESOLVED: [23x] VarnishHighThreadCount: Varnish's thread count on cp1100:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [16:03:00] wikibugs is toasted? [16:05:43] inflatador: I just sent https://phabricator.wikimedia.org/T365616#9880639 your way [16:06:00] inflatador: let me know if that's clear enough [16:22:38] 10netops, 06Infrastructure-Foundations, 06SRE: Sub-optimal cloud routing for WMCS in eqiad when link fails - https://phabricator.wikimedia.org/T367203 (10cmooney) 03NEW p:05Triage→03Low [16:22:50] 06Traffic: LVSRealserverMSS alert is broken for ferm based hosts - https://phabricator.wikimedia.org/T367204 (10Vgutierrez) 03NEW [16:23:01] can we really have a foo.discovery.wmnet name that points to bar.wikimedia.org, bblack? [16:27:51] mutante: I don't know, sounds odd, but what's the context? [16:31:45] bblack: the context is your comment here https://phabricator.wikimedia.org/T365259#9842712 [16:31:57] a service behind ATS that still has public IP [16:32:10] or would that be a new zone discovery.wikimedia.org [16:32:34] assuming I always use discovery records in ATS config [16:33:10] we are thinking to move gitlab behind cache but without reimaging at first [16:33:14] basically following your advice there [16:33:47] public only in the sense that it's in that VLAN.. we would still drop packets not coming from CACHES [16:35:23] or we could forget about the entire discovery record and just put the map in ATS config. only difference would be ATS config edit instead of DNS edit to change the backend [16:38:55] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880841 (10cmooney) [16:39:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880864 (10cmooney) [16:40:22] mutante: yeah, this is an odd/special case... [16:40:36] my thinking on this is basically something like: [16:41:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9880893 (10cmooney) [16:41:56] 1) yes, define an ATS backend that points at gitlab.wikimedia.org (discovery doesn't serve much purpose here, esp given it's likely temporary) [16:43:54] 2) Define a new public IP in whichever side makes sense (I guess high-traffic1/text-cluster?) and set that up via LVS to point into the text cluster, kind of like "text-next" is in geo-resources, etc... [16:44:25] 3) for the LVS part, have it have two services: git port that runs directly from lvs->gitlab, and https port that goes into the text cluster for cache defenses. [16:44:40] 4) Then after doing/testing all this, try to move the public hostname to this new IP [16:45:08] 1) ACK, I also don't see much of a difference whether a failover would be in DNS or in backend.yaml, for this case. [16:45:34] sometime later if this all works out, we could look at moving the gitlab realserver to a private vlan/hostname [16:45:38] 2-3) We are thinking now that loadbalancing doesn't add anything as long as there is only one backend per service [16:45:57] so we were about to copy the existing phabricator setup [16:46:05] which is behind ATS but no geoip/LVS [16:46:26] the thinking was "what's the point if it's active/passive and only 1 machine" [16:46:38] because there's not just HTTPS traffic, was what I thought [16:46:42] at least for the gerrit case that's true [16:46:46] 4) yea, we would reimage and do bookworm at the same time [16:47:15] yea, but both https and ssh service would always just have same single backend per DC [16:47:26] yeah but our cahces can't route ssh [16:47:35] lvs can, at layer 4 [16:47:45] ah, important detail, ack [16:48:24] 1) already helped me for today.. so that I won't bother with making discovery records right now [16:48:29] and I can start with something [16:48:30] so, to split off ssh by a different pathway (so that https goes via-caches, but ssh direct), the lvs service that manages the public IP has to have an ssh service [16:49:06] (which backends to just the gitlab machine directly. it's not doing any real loadbalancing, it's purpose is just to split the ssh port from the https port) [16:49:31] but if we just do it like phab.... phab is a CNAME to our normal text IP that's shared by enwiki, etc [16:49:43] I figured we probably don't want to make the ssh port available on all those other hostnames [16:49:53] it just seems weird and ripe for excess abuse or something [16:50:20] so this is what leads me around to saying "maybe just give it its own separate public IP in high-traffic1, which maps into the text-cluster anyways, kinda like text-next" [16:51:14] the other other alternative is we can split hostnames instead of splitting ports [16:51:18] hmm, taking notes. *nod* so to complicate it some more. we'd need like 6 services. gitlab, gitlab-replica-a, gitlab-replica-b and each for https and ssh [16:51:36] really? [16:51:44] (for the public?) [16:52:16] the reason given so far is that the replicas are used for testing and when you do security upgrades you need to access their web UI to create access tokens [16:52:42] tbd I guess [16:52:45] do they have their own IPs/machines now, or is it all derived from host-header/SNI on gitlab.wm.o? [16:53:16] they are like aliases for machines [16:53:23] yea, their own IP/machine [16:53:29] all in public VLAN [16:53:43] host gitlab-replica-a.wikimedia.org [16:53:43] Host gitlab-replica-a.wikimedia.org not found: 3(NXDOMAIN) [16:53:49] ? [16:53:58] gitlab-replica.wikimedia.org [16:54:02] gitlab-replica-old.wikimedia.org [16:54:04] ah ok [16:54:12] I skipped the part that we want to rename that [16:54:17] because "old" is a bad name [16:54:17] sure [16:54:30] it's not really old, it's just "the other one" [16:54:38] well for that matter, this is all going to get naming-complicated, since currently the machine itself uses the main public hostname [16:55:01] it would've been better, in hindsight, to name the machine something else and just dns-map the public name into it or something [16:55:13] gitlab1001.wikimedia.org with gitlabd.wikimedia.org pointing into it or whatever [16:55:17] I want to start with just the simple renaming now. [16:55:35] well the way it is now, we eventually run into conflicts [16:55:56] splitting the ssh hostname would help as well, then we don't have to add a new text-cluster IP either [16:56:41] the machines have 2 IPs, they have a hostname and a service name. [16:56:55] one the same interface [16:58:30] "serivce" name meaning the ssh part? [16:58:41] so gitlab.wikimedia.org is also gitlab2002.wikimedia.org and so on [16:59:09] oh, ok [16:59:15] inet 208.80.153.7/27 and inet 208.80.153.8/32 [16:59:20] I assumed from earlier that the main machine was actually named gitlab.wikimedia.org [16:59:20] both on interface eno1 [16:59:38] no, it's both [16:59:51] we treat the IPs differently in firewall rules [16:59:58] and where things are listening [17:00:05] ok [17:00:10] but https+ssh share an IP, right? [17:00:13] and a hostname? [17:00:14] yes [17:00:19] yes [17:00:20] (well, public hostname) [17:00:21] ok [17:00:52] the cleanest thing would be to make a new hostname+IP for the ssh part, really [17:01:09] gotcha [17:01:15] then we don't have to do the extra high-traffic1 IP, and we don't have to lvs-split ports, etc [17:01:47] so if we were starting fresh but going for the "behind the caches for https" design, the layout would've looked something like this: [17:02:31] 1) There's the actual machine: say gitlab1001.wikimedia.org. it's on a public vlan with a public primary IP, but it ferm-denies outside traffic to that IP. [17:03:47] 2) There's the https hostname like "gitlab.wikimedia.org". We map that to the text cache (in dns, using dyna, like phab.wm.o), and have an ATS definition that maps that traffic back to gitlab1001.wikimedia.org as the final destination. [17:04:36] 3) There's a git-ssh hostname separately, like "gitlab-ssh.wikimedia.org". We allocate this a public service IP and put it directly on eno1 like you're doing now, so that it still goes direct (and allow it in ferm rules) [17:06:17] makes sense, that's a very helpful summary [17:06:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9881020 (10VRiley-WMF) Swapped 40Base-LR4 in port et-0/0/53. [17:06:37] to get there from here in stages, one approach would be: [17:06:50] phabricator just uses phabricator.discovery.wmnet and CNAMEs it to one of the backends, not even dyna [17:07:17] yeah that's on the backend side [17:07:18] in that "services with multiple backends but without geoip" section in wmnet [17:07:21] and mostly pointless, I think [17:07:37] phab happens to live in a private subnet [17:07:51] it was only "yea, then we only have to make DNS changes and not touch ATS to change it" but that was it [17:08:02] right [17:08:15] anyways [17:08:20] to get there from here in stages, one approach would be: [17:08:44] also we used to think we always want to failover together with the "main / active DC" for appservers, but now we don't care about that anymore [17:08:52] listening [17:09:01] 1) Make a new hostname "gitlab-ssh.wikimedia.org", point it at the same place as gitlab.wikimedia.org for now (CNAME or just same-IP). [17:09:24] 2) Get people to switch their git configs to the new name and set a cutoff date for transitioning. [17:10:28] 3) Then define the ATS -> gitlab1001 or whatever backending part on the text ATS, and after testing, change "gitlab.wikimedia.org" to point at dyna (text cluster) instead of directly at the gitlab machine (which will fully break ssh to that hostname, and put stuff behind the caches) [17:11:17] bblack: ofc we aren't there now, but in the future we could have haproxy terminate TCP for SSH, and I think that would be useful [17:11:36] yeah, maybe [17:11:56] I worry about adding new attack surfaces to the caches in general though. [17:12:08] and stats/traffic confusion, etc [17:12:14] it is totally possible though [17:12:20] could be another haproxy instance on the nodes, even [17:12:36] but at the end of the day, there's only one final backend [17:12:41] but this way we could still re-use a bunch of things like requestctl ipblocks [17:12:54] you could put the haproxy revproxy on the gitlab box itself and go direct and get the same effects [17:13:00] true [17:13:20] well, most of the same effects, there's a much smaller peak pps you can handle that way ;) but yeah [17:13:42] eh [17:14:11] if it's going to overwhelm one machine on raw pps, I'd rather not have this L4 ssh traffic inflicting pain on our wikis' CDN [17:15:02] there's a lot of different ways we could do this class of things [17:15:28] but arguably our CDN should really be focused on the CDN use-case, not being a general firewall-defense-point for ancillary core-dc-only services [17:15:53] it just happens to be convenient to try to get that utility for the https part, and then annoying that ssh happens to ride along with it [17:16:10] arguably true, but that also means the few services that do need to accept TCP and not HTTPS are re-inventing several anti-abuse wheels [17:16:28] yeah but we could share the anti-abuse via config, rather than by sharing machines/resources [17:16:45] or! [17:16:58] [and I hate to say this out loud, because it will make some people cringe] [17:17:25] we could re-split the old "misc" varnish cluster back off of its current merger with 'text' (which would greatly simplify VCL woes) [17:17:34] and just make it a smaller cluster that only exists in the core DCs [17:18:11] and maybe even simplify it a bit (it doesn't really need ATS, or very much complex VCL stuff either, or very much TTL max... most "misc" backends are uncacheable anyways) [17:18:39] it doesn't even have to be varnish. it could maybe be just haproxy with its minimal caching+routing capabilities [17:19:05] (but for now varnish would be nice because requestctl) [17:19:54] we folded up misc into text a long long time ago, when we went through a giant consolidation of like 6-7 different clusters down to just text+upload [17:20:16] we thought it would be simpler eventually, but that never panned out [17:20:39] hence the VCL-switching nightmare that still exists [17:21:20] anyways, it's a direction to ponder [17:22:04] could start simple: just buy a few redundant cache_misc boxes for each core site, at much lower specs than our current cp nodes (no nvme storage, more-reasonable cpu/ram, etc) [17:22:26] move the cache_misc off of cache_text and into there, and maybe kill the ATS part to boot. [17:27:36] (and then, sure, if we want cache_misc's haproxy to handle ssh traffic, fine) [17:27:57] eventually it could just diverge more and more to be a specialized defensive frontend for core-dc ancillary services [17:28:28] it doesn't have to stay lockstep on architecture with the text/upload CDN. only to the degree that it helps share defensive metadata/config. [17:32:28] that's an interesting idea [17:33:10] thanks for this chat, Brandon. that was very helpful. Processing it. I took notes for our next meeting about this and went from "we need the whole LVS setup times 6" to "nah, just copy the phab setup and forget about all that, it's so much simpler".. to now arrive at your summary. and you know why? because it's the OLD Phab setup when Phab still had ssh, and then we stopped having that. so.. [17:33:16] of course.. ack [17:43:31] yeah the other stuff about cache_misc, I wouldn't block any of this on that. [17:43:38] just food for future thought [17:53:08] I kept that part of the chat but under a separate heading "future thoughts" [18:24:45] mutante: I suggest also thinking about experimenting with haproxy for proxying tcp to ssh [18:25:13] it's one of the few things that i will refer to as 'good software' [18:27:48] +1 [18:28:05] noted! [18:30:13] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9881669 (10cmooney) p:05Medium→03Low Thanks for the help with this @VRiley-WMF. The link has now b... [21:05:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9882267 (10VRiley-WMF) You're welcome @cmooney We do have spares if they are needed in the future. Clos... [21:06:10] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad, 06SRE: hw troubleshooting: Faulty 40GBase-LR4 link from cloudsw1-d5-eqiad to cloudsw1-f4-eqiad - https://phabricator.wikimedia.org/T367199#9882268 (10VRiley-WMF) 05Open→03Resolved [23:16:25] FIRING: SystemdUnitFailed: haproxy_stek_job.service on cp2039:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed