[06:41:18] <_joe_> FWIW sre.loadbalancer.restart-pybal should DTRT and not restart two pybals at the same time in the same datacenter [06:42:16] <_joe_> but yes, since john's changes had been reverted without notice or discussion, that cookbook is half-broken [08:10:49] Any problem with us merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/900704/ today ? Anything I should know before merging as far as putting it into production ? [08:58:34] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Serve an HTTP response for measurement domains directly from Varnish - https://phabricator.wikimedia.org/T332028 (10JameelKaisar) a:03JameelKaisar [08:59:43] claime: no problem AFAIK [09:00:13] 10Traffic, 10Performance-Team (Radar): unwind the Puppetized /etc/hosts override of statsd.eqiad.wmnet - https://phabricator.wikimedia.org/T239862 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi The hardcoded `statsd.eqiad.wmnet` entry is gone and I can confirm we're receiving statsd traffic on v6 too, t... [09:00:45] vgutierrez: No special instructions as far as restarts go or whatever ? Just let puppet do its job over 30 minutes? [09:01:05] no restart needed [09:05:44] a'ight [09:05:50] * claime cracks knuckles [09:05:57] _joe_: ^^heads up [09:08:55] <_joe_> claime: I'd mostly just disable puppet everywhere, run puppet in a single server, then check we didn't break rounting [09:09:00] <_joe_> *routing :) [09:09:19] yeah.. not melting the text cluster is nice :) [09:09:23] ;P [09:09:24] <_joe_> basically test a request for www.wikidata.org gets a response from the appservers [09:09:33] <_joe_> and test.wikidata.org from mw on k8s [09:09:42] <_joe_> then we can test multidc still works for mw on k8s [09:15:15] claime: backend.yaml is shared across clusters, so ideally you wanna disable it on upload too, ensure that's a NOOP there and reenable puppet in A:cp-upload [09:15:26] vgutierrez: ack [09:21:51] Looks ok on cp2028 [09:22:08] (looking at sudo atslog-backend) [09:23:35] I'm gonna re-enable puppet on cp-upload [09:23:57] claime: looking good [09:26:28] I'm hitting cp6009, I'll re-enable puppet on it to test the routing [09:26:44] claime: I'd recommend hitting ulsfo :) [09:26:57] instead of one of the busiest DCs at this time of the day [09:27:49] vgutierrez: ssh tunnel ? [09:27:58] or is there an easier way? [09:28:01] curl? [09:28:26] Ah, just hit the cp node with -H host:test.wikidata.org ? [09:29:10] or --connect-to [09:29:56] claime: curl --connect-to en.wikipedia.org:443:text-lb.ulsfo.wikimedia.org https://en.wikipedia.org -v -o /dev/null [09:30:13] Can I hit a specific cp-node that way ? [09:30:23] nope, you just a specific DC [09:30:25] *just [09:30:40] Or do I run it once, and I should stick to the same one on subsequent reqs? [09:30:46] yes [09:30:53] as long as you don't change the client IP [09:30:56] A'ight [09:31:03] Let's hope I don't get cgnat dropped :P [09:31:28] arg, ISPs with CGNAT here let you get a public IP for free if you ask for it [09:31:41] LTE ISPs don't here [09:31:48] oh, you're on LTE, nevermind then [09:41:49] Looks good on cp4037 [09:43:38] I'll re-enable puppet on ulsfo and watch traffic graphs for a bit to make sure we're ok [09:47:26] (PurgedHighEventLag) firing: (2) High event process lag with purged on cp3054:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:50:47] hmmm looking [09:52:26] (PurgedHighEventLag) firing: (2) High event process lag with purged on cp3054:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [09:52:28] It doesn't have my commit fwiw [09:52:50] nah nah, totally unrelated [09:56:36] XioNoX: ^^ that seems like a network link fart? [09:57:05] not sure what this error means [09:57:09] s/error/alert [09:57:48] so purged is complaing about not being able to reach kafka-main1002 and 1003 in port 9093 [09:59:22] <_joe_> vgutierrez: check with elukey too, there's work ongoing on the kafka-main cluster [09:59:28] there was a small one at 09:29 https://grafana.wikimedia.org/d/m1LYjVjnz/network-icmp-probes?orgId=1&var-site=esams&var-target_site=eqiad&var-role=cr&var-family=All [09:59:33] but otherwise looks fine [09:59:55] yup.. it should be back soon [10:02:18] vgutierrez: I am doing a roll restrart, but purged should be ok now [10:02:25] do you still see errors? [10:02:25] elukey: ack [10:02:31] nope [10:02:52] super [10:07:26] (PurgedHighEventLag) resolved: (4) High event process lag with purged on cp3054:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [10:08:16] All cp-text_ulsfo now have the mw-on-k8s lua patch [10:08:20] looks ok afaict [10:15:56] RW/RO keeps working as usual [10:16:20] quick check on https://grafana.wikimedia.org/goto/eQ21Uuf4k?orgId=1 [10:25:15] <_joe_> vgutierrez: we're verifying the ro/rw split right now [10:25:23] <_joe_> ulsfo goes RO to codfw anyways [10:25:44] yep [11:03:11] Ok we've tried it every which way, the new patch works [11:03:16] Re-enabling puppet for cp-text [11:04:11] <_joe_> the patch actually fixes something we didn't realize lol [11:04:37] Well we kinda did realize [11:04:53] We saw we were getting less traffic to mw-api-ext after adding test.wikidata [11:04:57] We just didn't know why [12:51:09] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10ayounsi) How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link goes down? [14:11:37] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye [14:46:52] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs4010.ulsfo.wmnet with OS bullseye completed: - lvs4010 (**PASS**) - Downtimed on Icinga/Alertmanager - Disabled... [14:57:53] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Avoid sub-optimal routing from CR routers to EVPN destinations - https://phabricator.wikimedia.org/T332781 (10cmooney) >>! In T332781#8741660, @ayounsi wrote: > How does this compare to taking iBGP down between LEAF1 to SPINE2 if the link g... [15:19:04] I was trying to figure out how to verify on a kafka node if the new TLS certificate (PKI) is leadingto handshake failures, and I found https://www.golinuxcloud.com/troubleshooting-tls-failures-wireshark/ [15:19:17] `tshark -f "port 9093" -Y "ssl.record.content_type == 21"` seems to work nicely [15:20:35] (lemme know if you have better / more precise way of doing it) [15:21:25] openssl s_client? [15:22:24] vgutierrez: yeah sorry, I didn't add that I'd need to figure out if any client of kafka-jumbo1001 fails the handshake due to the new tls cert [15:22:36] the hostname is the first one in the connection string, if it fails it moves to another one [15:22:51] but so far I have only one broker with the new TLS cert [15:23:00] and I wanted some confirmation before proceeding with the rest [15:23:15] this is why I am using tshark [15:23:37] yeah.. capturing traffic is the way to go then [15:23:46] openssl works fine, but if the client has only the puppet ca set it is a problem [15:24:41] -no-CAfile + -CAfile [15:25:05] and -no-CApath [15:25:11] sorry I didn't get the suggestion [15:25:37] just to stop openssl from loading system wide root CAs [15:25:50] ah okok [15:30:24] thanks :) [15:35:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) Sounds good to me. This is what we need to do with cloudcontrol2004-dev: * figure out how to... [15:35:38] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) Third batch |Host|U space|Existing port|New port| |cloudcephosd2001-dev|3|asw-b1-codfw ge-1/0/... [16:43:25] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) All remaining (non public-vlan) hosts have been moved and look good to me (reachable, MAC addr... [18:05:43] 10HTTPS, 10Traffic, 10SRE, 10Tracking-Neverending: HTTPS Plans (tracking / high-level info) - https://phabricator.wikimedia.org/T104681 (10BCornwall) [18:05:55] 10HTTPS, 10Traffic, 10SRE: Enable HSTS on store.wikimedia.org for HTTPS - https://phabricator.wikimedia.org/T128559 (10BCornwall) 05Stalled→03Declined I'm going to decline this as it's not possible. I will follow it up with T333591 which tracks moving the domain. [18:37:16] 10Traffic, 10SRE: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [20:27:39] 10Traffic, 10SRE: Performance implications of buffer sizes in Apache Traffic Server intercept plugins - https://phabricator.wikimedia.org/T287847 (10BCornwall) [20:39:48] 10Traffic, 10SRE, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall) [20:40:03] 10Traffic, 10SRE, 10Patch-For-Review: increase of network errors on alert1001 after certspotter has been enabled - https://phabricator.wikimedia.org/T303593 (10BCornwall) p:05Medium→03Triage [20:42:40] 10Traffic, 10SRE, 10observability, 10Upstream: flapping icinga Letsencrypt TLS cert alerts around renewal time - https://phabricator.wikimedia.org/T293826 (10BCornwall) [20:48:20] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE-tools, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10BCornwall) [20:52:53] 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) p:05Medium→03Triage [20:52:56] 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) [20:53:48] 10Traffic, 10WMF-Legal, 10Performance-Team (Radar), 10Privacy: Add no-transform to Cache-Control header - https://phabricator.wikimedia.org/T218618 (10BCornwall) I've adjusted the title/body to reflect this change in ticket scope. Traffic still needs to determine whether or not to pursue this. [20:58:41] 10Traffic, 10SRE, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10BCornwall) p:05Medium→03Triage [20:58:51] 10Traffic, 10Patch-For-Review: Update certspotter - https://phabricator.wikimedia.org/T204993 (10BCornwall) [21:04:13] 10Traffic, 10Patch-For-Review: Test ESI feasibility with current Varnish installation - https://phabricator.wikimedia.org/T308799 (10BCornwall) 05Open→03Stalled p:05High→03Triage [21:05:13] 10Traffic, 10RESTBase-API, 10Documentation: I am hitting a rate limit on REST API endpoint - https://phabricator.wikimedia.org/T307610 (10BCornwall) 05Open→03Stalled p:05Medium→03Triage [21:14:43] 10Traffic, 10SRE, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BCornwall) 05Open→03In progress p:05Medium→03High a:05BBlack→03BCornwall Since there wasn't any feedback on this, I guess I'll claim this ticket since I'm actively trying to fix this. I'll ask me... [21:14:54] 10Traffic, 10PM: Clean up Traffic tag/workboard - https://phabricator.wikimedia.org/T289787 (10BCornwall) [21:31:34] 10Traffic: Improve runbooks for OCSP-related alerts - https://phabricator.wikimedia.org/T292397 (10BCornwall) [21:33:52] 10Traffic, 10DNS, 10Infrastructure-Foundations, 10SRE-tools, 10netbox: sre.dns.netbox cookbook dosn't support period terminated domains - https://phabricator.wikimedia.org/T306809 (10Volans) Correct, and we've already the first validators in netbox-next that will be released to prod shortly so this can b...