[07:58:07] vgutierrez: o/ is it ok if I rollout https://gerrit.wikimedia.org/r/c/operations/puppet/+/901118 ? The idea that I have is to disable puppet on all cp nodes, try on one, check for errors in logs + grafana dashboard and re-enable [08:01:00] Sounds good [08:03:16] super, I'll do it in a bit [08:14:32] elukey: what I don't know if that's gonna require a purged restart though [08:15:17] hmmm purged is one of those systemd units that we automatically restart on config changes [08:18:29] 10Traffic, 10SRE-Sprint-Week-Sustainability-March2023, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) I see... [08:19:59] vgutierrez: yeah I think it requires a purged restart to allow the kafka client to reconfigure [08:20:30] config file notifies purged, and the systemd service has restart => true so puppet should do it [08:22:11] ah you meant if a manual action was needed [08:22:12] okok [08:22:38] indeed [08:28:01] ack, disabled puppet on c:profile::cache::purge [08:28:12] going to merge and test on cp1075 [08:28:20] eqiad? :) [08:28:40] maybe a lDC with less traffic like ulsfo? ;P [08:28:42] *DC [08:29:33] sure sure, the worst that can happen is that the kafka client doesn't connect, we can easily revert, this is why I wasn't too worried. Anyway, will start with ulsfo :) [08:29:55] cp4037? [08:31:03] running puppet there [08:31:29] nice [08:32:20] watching https://grafana.wikimedia.org/d/RvscY1CZk/purged?orgId=1&var-site=ulsfo&var-cluster=cache_text&var-instance=cp4037&var-datasource=thanos&from=now-3h&to=now [08:32:43] from the logs all good [08:34:06] * vgutierrez double checking [08:34:55] yup, all good [08:35:00] we got an unrelated warning to fix though [08:35:16] I'll fill a task [08:35:22] I saw that yeah, the compression bit, probably snappy [08:35:36] or maybe the warning is new after updating to bullseye [08:37:56] should be "compression.codec": "snappy", in theory the client should know how to decompress etc.. [08:39:04] vgutierrez: I can re-enable in ulsfo and do a batch:1 series of restarts, so we can see how it goes on a broader range of nodes [08:39:19] sure [08:40:59] 10Traffic: purged issues a config warning on service start - https://phabricator.wikimedia.org/T332669 (10Vgutierrez) [08:41:27] 10Traffic: purged issues a config warning on service start - https://phabricator.wikimedia.org/T332669 (10Vgutierrez) p:05Triage→03Medium [08:41:36] ack doing it [08:41:40] 10Traffic, 10Platform Engineering Roadmap Decision Making, 10SRE-Sprint-Week-Sustainability-March2023, 10serviceops, and 3 others: Reduce rate of purges emitted by MediaWiki - https://phabricator.wikimedia.org/T250205 (10Joe) 05Open→03Declined The task was more or less refused by the owners of the subs... [09:03:15] vgutierrez: ulsfo done! afaics all good, shall I simply re-enable puppet and let it do its work on the rest of the nodes? [09:06:08] yep [09:07:17] done thanks for the support :) [09:22:15] 10Traffic, 10Phabricator, 10SRE: Phabricator should cache tasks for a few minutes for logged-out users - https://phabricator.wikimedia.org/T274228 (10Aklapper) a:05mmodell→03None @mmodell: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assi... [09:31:38] 10Traffic, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10Volans) [09:35:00] 10Traffic, 10PyBal, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10Vgutierrez) [09:35:10] 10Traffic, 10PyBal, 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic-Icebox, 10Sustainability (Incident Followup): Pybal should reject a confctl configuration that indicates only one cp-text is pooled - https://phabricator.wikimedia.org/T245060 (10Vgutierrez) [09:37:07] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8711755, @Papaul wrote: > @cmooney Please see first batch proposal. We can mov... [09:38:48] 10Traffic, 10SRE: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10Aklapper) a:05ssingh→03None @ssingh: Removing task assignee as this open task has been assigned for more than two years - See the email sent to task assignee on Feb... [09:44:17] 10Traffic, 10SRE, 10Wikimedia-Incident: Power incident in eqsin - https://phabricator.wikimedia.org/T206861 (10Vgutierrez) [09:44:33] 10Traffic, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): Puppet doesn't restart ferm on failure - https://phabricator.wikimedia.org/T206951 (10Vgutierrez) 05Open→03Resolved a:03jbond This is actually already fixed by https://gerrit.wikimedia.org/r/c/operations/puppet... [09:47:21] 10Traffic, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10cmooney) [09:51:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8713630, @aborrero wrote: >>>! In T327919#8711755, @Papaul wrote: >> @cmooney P... [09:56:48] 10Traffic, 10SRE-Sprint-Week-Sustainability-March2023, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Vgutierrez)... [10:13:32] 10Traffic, 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): (Re) evaluate effectiveness / usefulness of varnish/haproxy traffic drop alerts - https://phabricator.wikimedia.org/T310608 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi I'm boldly resolving t... [10:29:33] vgutierrez: hello sir, when you have a moment: https://gerrit.wikimedia.org/r/c/operations/alerts/+/900626 [10:36:17] 10Traffic, 10SRE-Sprint-Week-Sustainability-March2023, 10Sustainability (Incident Followup): cp3050 seemd more affected then otheres in recent incident - https://phabricator.wikimedia.org/T330682 (10Vgutierrez) a:03Vgutierrez [10:38:25] thank you [14:40:36] 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE, 10ops-eqiad: Q2:(Need By: TBD) Rows E/F network racking task - https://phabricator.wikimedia.org/T292095 (10Jclark-ctr) @cmooney Racks e5-7 f5-7 have been cabled and racked do you want to use same ticket for those Switches? [15:57:26] 10Traffic, 10SRE-Sprint-Week-Sustainability-March2023, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) >>! In... [16:04:50] 10Traffic, 10SRE-Sprint-Week-Sustainability-March2023, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Vgutierrez)... [16:43:56] 10Traffic, 10SRE: Deploy Wikidough: DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) public resolver - https://phabricator.wikimedia.org/T252132 (10ssingh) a:03ssingh [16:44:49] 10Traffic, 10SRE: Wikidough: Support EDNS(0) Padding: RFC 7830 and RFC 8467 - https://phabricator.wikimedia.org/T274431 (10ssingh) a:03ssingh