[07:50:15] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) a:03ayounsi Emailed the 2 networks again. I'll delete the sessions if they don't reply or fix them.
[10:05:01] <wikibugs>	 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10ayounsi) With the addition of `L3` switches it makes sens to not only take into consideration OSPF or `L2` vlans.  For unicast "regular" external...
[10:48:24] <vgutierrez>	 _joe_: morning sir, let me know what you think about https://gerrit.wikimedia.org/r/c/operations/puppet/+/974623
[10:53:36] <_joe_>	 vgutierrez: sure, I'm a tad busy this morning, and I have tons of meetings in the afternoon; I'll do what I can
[11:03:53] <vgutierrez>	 _joe_: ack
[11:09:17] <vgutierrez>	 _joe_: any else that could review service catalog related changes?
[11:09:22] <vgutierrez>	 *anybody
[11:09:45] <vgutierrez>	 no problem in waiting though, but I don't wanna burden you unnecessarily 
[11:28:20] <_joe_>	 vgutierrez: you already have me and volans on the patch
[11:28:34] <vgutierrez>	 oh cool
[11:28:46] * volans hides from the puppet one
[11:29:19] <volans>	 I've already reviewed the spicerack one, and btw thanks vgutierrez for sending that spontaneusly, is probably the first time we get the patch before realizing is missing :D
[11:29:34] <vgutierrez>	 thank _joe_, he told me that it was needed :)
[11:29:52] <volans>	 lol
[11:31:37] <_joe_>	 vgutierrez: I tell everyone
[11:32:33] <vgutierrez>	 volans: let's go with spicerack first then :)
[11:33:19] <volans>	 sure, feel free to +2 anytime (I've already +1ed and john too), I'll make a release later
[11:41:54] <vgutierrez>	 done :)
[11:42:19] <duesen>	 Hi, CI for RESTbase is broken because of a TLS issue with parsoid-external-ci.access.beta.wmflabs.org. Any ideas for fixing that? It currently prevents any changes to restbase from being merged. The ticket is https://phabricator.wikimedia.org/T350353
[11:43:40] <duesen>	 This problem has been around for quite a while for local development, but only recently started to also affect github. I'm surprised it hasn't been broken there for longer.
[11:47:43] <vgutierrez>	 duesen: already commented there, if that worked in the past some change moved that endpoint from using Let's Encrypt certs to WMF PKI ones
[12:42:29] <sukhe>	 _joe_: when you sit down to review patches could you also look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/975009?
[12:42:43] <sukhe>	 not urgent <3
[13:03:42] <jinxer-wm>	 (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:03:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[14:09:35] <wikibugs>	 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll nee...
[15:13:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) prometheus_puppet_agent_stats.timer Failed on lvs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:18:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus_puppet_agent_stats.timer Failed on cp1104:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:23:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: (13) prometheus_puppet_agent_stats.timer Failed on cp1104:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:25:59] <vgutierrez>	 volans: let me know when it's safe to merge the puppet side of the spicerack CR
[15:26:18] <volans>	 vgutierrez: ack, in a meeting right now, ,I can make a release after that
[15:26:36] <vgutierrez>	 thx!
[15:26:39] <volans>	 I was wiating a sec to get a patch from jo.hn in too
[15:26:45] <vgutierrez>	 sure
[15:39:12] <jinxer-wm>	 (SystemdUnitFailed) firing: (10) prometheus_puppet_agent_stats.timer Failed on cp1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:44:13] <jinxer-wm>	 (SystemdUnitFailed) resolved: (11) prometheus_puppet_agent_stats.timer Failed on cp1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:15] <wikibugs>	 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[16:27:56] <duesen>	 vgutierrez: do you have an idea for fixing this? Right now it's a blocker for all work on restbase. 
[16:28:28] <vgutierrez>	 duesen: that endpoint needs to use a LE certificate rather than an internal one
[16:28:42] <vgutierrez>	 deployment-prep environment has an acme-chief instance so it shouldn't be a big deal
[16:28:48] <vgutierrez>	 but I'm not familiar with parsoid puppetization 
[16:29:49] <wikibugs>	 10Acme-chief: acme-chief service started on a passive node after reimage - https://phabricator.wikimedia.org/T351655 (10Vgutierrez)
[16:30:32] <wikibugs>	 10Acme-chief, 10Traffic: acme-chief service started on a passive node after reimage - https://phabricator.wikimedia.org/T351655 (10Vgutierrez) p:05Triage→03High
[16:48:48] <duesen>	 vgutierrez: neither am I...
[16:50:44] <vgutierrez>	 I can take a look but I suspect that's more for hnowlan's scope than mine
[16:53:14] <duesen>	 ok, thanks
[17:01:55] <vgutierrez>	 it looks like profile::tlsproxy::envoy::ssl_provider should be set to acme for 	deployment-parsoid12 (after configuring acme-chief there to issue the expected certificate)
[17:32:23] <taavi>	 it's been this way since https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/d0378e96aa0e94cf44b3ad07e4a658990f2a613d%5E%21/#F0 (early 2021), why is it causing problems only now?
[18:19:02] <volans>	 vgutierrez: all done, spicerack has ipip_encapsulation support
[18:19:16] <volans>	 >>> a.lvs.ipip_encapsulation
[18:19:16] <volans>	 False
[18:21:08] <vgutierrez>	 volans: thx
[18:27:36] <wikibugs>	 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host acmechief2001.codfw.wmnet with OS bookworm
[19:03:42] <wikibugs>	 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host acmechief2001.codfw.wmnet with OS bookworm completed: - acmechief2001 (**WARN**)   - Downtimed on Icinga/A...
[19:17:44] <jinxer-wm>	 (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4045:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown
[19:17:58] <sukhe>	 ^ host was depooled and rebooting, expected