[06:41:03] 06Traffic, 06collaboration-services, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10824976 (10Jelto) [06:50:42] 06Traffic, 06collaboration-services, 10Release-Engineering-Team (Radar): Separate Gerrit https and ssh/git hostnames - https://phabricator.wikimedia.org/T394271#10824988 (10Jelto) >>! In T394271#10822942, @bd808 wrote: >> a new hostname like gerrit-git.wikimedia.org (tbd) for SSH/Git. > > Naming is always t... [07:13:55] 06Traffic, 06Data-Engineering-Radar, 10Observability-Logging, 13Patch-For-Review: Shutdown varnishkafka instances - https://phabricator.wikimedia.org/T393772#10825044 (10Fabfur) [10:19:47] hi! nginx.service fails to reload on ncredir hosts, is that a known issue? https://phabricator.wikimedia.org/P76219 [10:20:16] I noticed it via the failed puppet runs: https://puppetboard.wikimedia.org/nodes?status=failed [10:34:16] I don't think so [10:34:19] thx for the ping moritzm [10:36:36] I vaguely remember we ran into this before [10:37:38] to be honest I don't know why nginx is refusing to use that ocsp response data [10:37:50] or maybe the map_hash_max_size note is a red herring and I've only seen that one before and it didn't really have any impact [10:38:04] no impact [10:38:10] the issue with the reload is the OCSP response file [10:43:05] so acme-chief is failing somehow to get the OCSP data for non-canonical-redirect-3 [10:43:08] that's new [10:49:10] uh oh... [10:49:22] Let's Encrypt is getting rid of OCSP already? [10:50:24] I thought that was in August or so [10:51:09] it looks like E6 is not using OCSP [10:51:30] root@acmechief2002:~# openssl x509 -issuer -ocsp_uri -noout -in /var/lib/acme-chief/certs/non-canonical-redirect-3/live/ec-prime256v1.crt [10:51:30] issuer=C = US, O = Let's Encrypt, CN = E6 [10:51:30] root@acmechief2002:~# openssl x509 -ocsp_uri -issuer -noout -in /var/lib/acme-chief/certs/non-canonical-redirect-5/live/ec-prime256v1.crt [10:51:30] http://e5.o.lencr.org [10:51:30] issuer=C = US, O = Let's Encrypt, CN = E5 [10:51:54] sigh.. this is bad [10:52:46] May 7, 2025 [10:52:46] Prior to this date we will have added CRL URLs to certificates [10:52:46] On this date we will drop OCSP URLs from certificates [10:53:06] ok... [10:53:16] we totally missed that [10:56:32] 06Traffic, 06SRE: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10825755 (10Vgutierrez) 05Open→03In progress p:05Triage→03Unbreak! Let's encrypt already stopped including OCSP urls in new certificates and it's already caus... [10:58:23] luckily for haproxy is a NOOP [10:58:40] the .ocsp won't be there anymore and that's it [10:59:26] moritzm: modules/install_server/templates/apt.wikimedia.org.conf.erb: ssl_stapling_file /etc/acmecerts/apt/live/ec-prime256v1.ocsp; [10:59:34] moritzm: can you take care of removing that one? [11:02:36] ack, I'll test it on apt2002 and will prep a patch [11:07:02] fabfur: please review https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146552 ASAP [11:07:28] checking [11:08:07] LGTM [11:14:41] could I please also get a pair of eyes on https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146555 ? [11:16:00] I think it's ok [11:16:14] ah, vg already +1ed [11:16:30] cheers [11:17:41] moritzm: merge mine if showed up on your puppet-merge session :) [11:18:02] there was only mine, I'll drop a note when my merge is completed [11:18:12] vgutierrez: you can merge now [11:18:25] merging, thx <3 [11:28:29] moritzm: you need to adjust apt.wm.o monitoring as well [11:29:03] ah right [11:31:25] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1146562 [11:31:28] +1 [11:31:37] cheers [11:32:31] want us to roll it out? [11:32:37] can do while we wait on other stuff [11:33:08] 06Traffic, 06SRE, 13Patch-For-Review: Research and respond to Let's Encrypt's intent to deprecate OCSP in favour of CRLs - https://phabricator.wikimedia.org/T370821#10825867 (10Vgutierrez) p:05Unbreak!→03High [11:33:26] I'm just waiting for CI to complete, will merge when that is in [11:33:54] ok. thanks for the ping today! [11:35:58] which totally affirms your proposal to alert harder on failing Puppet runs! [11:36:09] I didn't say that :D [11:36:13] you did! [11:36:33] because it's so true :-) [11:36:45] I think that is affirmed every week except I don't know what to do about it so I am silent :P [11:37:06] some check where all members of a given role are failing, but not the rest would be very useful as a starter [11:37:42] given that all ncredir hosts were failing, but nothing else [11:38:34] yeah even that would help. anything that helps us break away from discovering (or polling) this stuff vs being alerted. most of the times though, site-wide failing Puppet runs have a common cause, but not always. [11:42:03] I forced a puppet run on alert1002, it should recover soonish [11:42:11] thanks! [13:16:12] 06Traffic, 06SRE, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826148 (10Vgutierrez) p:05Triage→03Medium [13:18:33] 06Traffic, 06SRE, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826152 (10ssingh) [14:24:32] 06Traffic, 06SRE, 13Patch-For-Review: Lower geodns TTLs for dyna.wm.org and upload.wm.org from 300s (5 min) to 180s (3 min) - https://phabricator.wikimedia.org/T394312#10826667 (10ssingh) [16:53:45] 06Traffic, 06Experimentation Lab: Include all CDN SANs on eventgate-analytics-external.discovery.wmnet:4692 TLS certificate - https://phabricator.wikimedia.org/T394437 (10Vgutierrez) 03NEW [16:56:55] 06Traffic, 10Data-Platform, 06Data-Platform-SRE, 06Experimentation Lab: Include all CDN SANs on eventgate-analytics-external.discovery.wmnet:4692 TLS certificate - https://phabricator.wikimedia.org/T394437#10827481 (10dr0ptp4kt) [17:10:35] 06Traffic, 10Data-Platform, 06Data-Platform-SRE, 06Experimentation Lab: Include all CDN SANs on eventgate-analytics-external.discovery.wmnet:4692 TLS certificate - https://phabricator.wikimedia.org/T394437#10827517 (10dr0ptp4kt) [17:25:06] 06Traffic, 10Data-Platform, 06Data-Platform-SRE, 06Experimentation Lab: Include all CDN SANs on eventgate-analytics-external.discovery.wmnet:4692 TLS certificate - https://phabricator.wikimedia.org/T394437#10827569 (10Vgutierrez) [17:34:26] 06Traffic, 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: Clean-up varnishkafka webrequest leftovers in Hadoop-world - https://phabricator.wikimedia.org/T394011#10827630 (10JAllemandou) [17:35:22] 06Traffic, 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: Clean-up varnishkafka webrequest leftovers in Hadoop-world - https://phabricator.wikimedia.org/T394011#10827632 (10JAllemandou) [17:48:39] 06Traffic, 10Data-Engineering (Q4 2025 April 1st - June 30th), 13Patch-For-Review: Clean-up varnishkafka webrequest leftovers in Hadoop-world - https://phabricator.wikimedia.org/T394011#10827714 (10JAllemandou) [19:13:27] 06Traffic, 06DC-Ops, 10ops-esams, 06SRE: lvs3009 NIC HW issue (Broadcom, eno8303) - https://phabricator.wikimedia.org/T393616#10827984 (10BCornwall) 05Open→03Resolved I'm not seeing any errors in the kernel log, anomalies in the graphs, or outputs in `getsel`. I'll go ahead and resolve this. Thanks... [19:18:07] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10827991 (10BCornwall) [19:54:16] 06Traffic, 06[Archived]Wikidata Dev Team, 10Prod-Kubernetes, 06SRE, and 3 others: Frequent 500 Errors and Timeouts When Adding Statements to New Item or Lexeme-typed Properties - https://phabricator.wikimedia.org/T374230#10828104 (10ArthurPSmith) Confirming this works for me now - https://www.wikidata....