[07:54:06] 10netops, 10Infrastructure-Foundations: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) p:05Triage→03High [08:30:29] 10netops, 10Infrastructure-Foundations, 10SRE: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) > Kindly be informed that we have logged your issue under ref 01420952, we will investigate and get back to you with our findings. [09:44:16] 10Traffic, 10Data-Engineering, 10SRE, 10Patch-For-Review, 10User-zeljkofilipin: intake-analytics is responsible for up to a 85% of varnish backend fetch errors - https://phabricator.wikimedia.org/T306181 (10phuedx) >>! In T306181#8013301, @Ottomata wrote: > Thanks ben! Seconded. Thanks for all of your w... [12:29:58] 10Traffic, 10DC-Ops, 10SRE, 10decommission-hardware, 10ops-ulsfo: decommission bast4002.wikimedia.org - https://phabricator.wikimedia.org/T288579 (10MoritzMuehlenhoff) [12:30:25] 10Traffic, 10DC-Ops, 10SRE, 10ops-ulsfo: (Need By: TBD) rack/setup/install ganeti4004 - https://phabricator.wikimedia.org/T289715 (10MoritzMuehlenhoff) 05Open→03Resolved ganeti4004 has been added to the ganeti/ulsfo cluster now. Cluster is currently rebalancing. [14:52:22] I'd love some opinions on https://phabricator.wikimedia.org/T310303 :) [15:00:42] vgutierrez: \o https://gerrit.wikimedia.org/r/c/operations/puppet/+/807133 I have this CR and would like to do the pybal dance :) [15:00:55] Well "like" is a strong word... [15:01:04] :( [15:01:24] Not about you, all about the scariness :) [15:01:43] hmm I'm sad cause your previous CR broke conftool [15:02:08] I guess that you need to add the backend servers to ml_staging cluster [15:02:19] as [15:02:22] vgutierrez@puppetmaster1001:~$ sudo -i confctl --quiet select 'cluster=ml_staging,service=kubesvc' get [15:02:22] vgutierrez@puppetmaster1001:~$ [15:02:25] that's still empty [15:02:32] Hmm, I see [15:02:53] I think the doc linked to the icinga alert could be improved [15:03:34] There is a doc in the alert? [15:03:41] if you're talking about https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=puppetmaster2001&service=Confd+template+for+%2Fsrv%2Fconfig-master%2Fpybal%2Fcodfw%2Finference-staging and https://wikitech.wikimedia.org/wiki/Confd#Monitoring [15:03:54] klausman: see the small "extra notes" at the top right [15:04:01] so let's add the real servers before moving on with https://gerrit.wikimedia.org/r/c/operations/puppet/+/807133 [15:04:04] subtle [15:04:10] or the folder like icon in https://icinga.wikimedia.org/alerts [15:04:19] cause pybal won't be happy either with a service without backend servers available [15:04:21] and the end of the "service" column [15:05:44] I've update the change to include that (I think) [15:12:32] klausman: looks good, please set the servers as pooled before running puppet on lvs2010 [15:12:46] roger [15:13:00] 10Traffic, 10SRE, 10Patch-For-Review, 10SRE Observability (FY2021/2022-Q4), 10User-fgiunchedi: Migrate Traffic Prometheus alerts from Icinga to Alertmanager - https://phabricator.wikimedia.org/T300723 (10BCornwall) the varnish-mmap-count situation could be resolved with https://github.com/prometheus/proc... [15:15:25] How do yo pool a server if it has neither the helper scripts nor confctl installed? [15:16:36] you can run "sudo -i confctl pool --hostname foo.eqiad.wmnet" on puppetmaster1001 [15:16:45] thx [15:18:23] pooled both workes, now running pippet merge... [15:18:35] Moritz, ok to merge "Remove old buster IDPs from Puppet"? [15:19:05] moritzm: ^^^ [15:19:46] yes, please [15:21:17] about to do run-puppet-merge on O::lvs::balancer [15:22:55] vgutierrez: I presume 2009 and 2010 are the pybals that will need a resraer (no, not doing it right now) [15:23:01] restart* [15:23:08] yes, lvs2010 and then lvs2009 [15:23:22] ok, waiting on cumin atm [15:24:02] BTW, servers are still set as inactive and with weight 0 [15:24:30] I ran the command moristz supplied and it completed without error? [15:24:47] vgutierrez@puppetmaster1001:~$ sudo -i confctl --quiet select 'cluster=ml_staging,service=kubesvc' get [15:24:47] {"ml-staging2002.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_staging,service=kubesvc"} [15:24:47] {"ml-staging2001.codfw.wmnet": {"weight": 0, "pooled": "inactive"}, "tags": "dc=codfw,cluster=ml_staging,service=kubesvc"} [15:25:55] I don't know what's missing [15:26:10] maybe I need to set the weight? But to what value? [15:26:41] non zero basically, "1" seems the chosen one for ml_serve [15:27:05] Ok set weight to 1 for the two workers [15:27:15] and servers aren't pooled yet, they are marked as inactive, set/pooled=yes will pool them [15:28:41] fixed that as well and ack'd the icinga alerts [15:29:25] nice, confctl looks good now :) [15:29:32] Yaaaay :) [15:29:50] Ok to restart pybal on lvs2010? [15:30:01] go ahead please [15:30:31] restarted [15:31:27] vgutierrez@lvs2010:~$ sudo -i ipvsadm -Ln |grep 30443 [15:31:27] TCP 10.2.1.58:30443 wrr [15:31:27] -> 10.192.0.201:30443 Route 1 0 0 [15:31:27] -> 10.192.48.174:30443 Route 1 0 0 [15:31:35] looking good :) [15:31:44] Phew [15:33:29] Hmm, https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=lvs2010&service=PyBal+IPVS+diff+check [15:33:36] `CRITICAL: Hosts in IPVS but unknown to PyBal: set(['mw2320.codfw.wmnet'])` [15:33:40] Is that spurious? [15:34:17] 10netops, 10Infrastructure-Foundations, 10SRE: Telia ulsfo transit v4 BGP down - https://phabricator.wikimedia.org/T311038 (10ayounsi) 05Open→03Resolved a:03ayounsi > This should be fixed. Looks like it was a configuration failure during the planned migration PWIC218882.3. Confirmed resolved. [15:35:25] Ok, that alert cleared, it seems [15:35:38] yep [15:36:06] about to restart pybal on lvs2009 [15:37:51] go ahead [15:38:21] and done. [15:38:33] TCP 10.2.1.58:30443 wrr [15:38:35] -> 10.192.0.201:30443 Route 1 0 0 [15:38:37] -> 10.192.48.174:30443 Route 1 0 0 [15:38:42] looks good. [15:40:14] thanks Valentin. these are always harrowing, but also rewarding :) [15:40:24] no problem [15:42:36] I am now getting a warning that there are stale template error files, will that go away by itself? [15:45:23] I've cleaned those manually [15:47:12] thank you! [15:50:56] (HAProxyEdgeTrafficDrop) firing: 35% request drop in text@ulsfo during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=12&orgId=1&from=now-24h&to=now&var-site=ulsfo&var-cache_type=text - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [15:55:56] (HAProxyEdgeTrafficDrop) resolved: (4) 60% request drop in text@drmrs during the past 30 minutes - https://wikitech.wikimedia.org/wiki/Monitoring/EdgeTrafficDrop - https://alerts.wikimedia.org/?q=alertname%3DHAProxyEdgeTrafficDrop [20:28:33] 10Traffic, 10DNS, 10SRE, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8007136, @Dzahn wrote: > There are incoming redirects into policy.wikimedia.org: > > https://wikimedia.... [21:19:02] 10Traffic, 10DNS, 10SRE, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Dzahn) >>! In T310738#8017973, @Varnent wrote: > @Dzahn - is that doable? I am not sure if we have redirected to web.archive.org... [22:06:36] 10Traffic, 10SRE: Set CORS headers on error pages? - https://phabricator.wikimedia.org/T270526 (10BCornwall) [22:08:00] 10Traffic, 10Infrastructure-Foundations, 10SRE, 10SRE-tools, 10IPv6: Some Traffic clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271144 (10BCornwall) [22:50:54] 10Traffic, 10DNS, 10SRE, 10WMF-Legal, and 2 others: Setup redirect of policy.wikimedia.org to Advocacy portal on Foundation website - https://phabricator.wikimedia.org/T310738 (10Varnent) >>! In T310738#8018203, @Dzahn wrote: >>>! In T310738#8017973, @Varnent wrote: >> @Dzahn - is that doable? I am not sur...