[07:50:15] 10netops, 10Infrastructure-Foundations, 10sre-alert-triage: Alert in need of triage: BGP status (instance cr2-eqdfw) - https://phabricator.wikimedia.org/T351083 (10ayounsi) a:03ayounsi Emailed the 2 networks again. I'll delete the sessions if they don't reply or fix them. [10:05:01] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10ayounsi) With the addition of `L3` switches it makes sens to not only take into consideration OSPF or `L2` vlans. For unicast "regular" external... [10:48:24] _joe_: morning sir, let me know what you think about https://gerrit.wikimedia.org/r/c/operations/puppet/+/974623 [10:53:36] <_joe_> vgutierrez: sure, I'm a tad busy this morning, and I have tons of meetings in the afternoon; I'll do what I can [11:03:53] _joe_: ack [11:09:17] _joe_: any else that could review service catalog related changes? [11:09:22] *anybody [11:09:45] no problem in waiting though, but I don't wanna burden you unnecessarily [11:28:20] <_joe_> vgutierrez: you already have me and volans on the patch [11:28:34] oh cool [11:28:46] * volans hides from the puppet one [11:29:19] I've already reviewed the spicerack one, and btw thanks vgutierrez for sending that spontaneusly, is probably the first time we get the patch before realizing is missing :D [11:29:34] thank _joe_, he told me that it was needed :) [11:29:52] lol [11:31:37] <_joe_> vgutierrez: I tell everyone [11:32:33] volans: let's go with spicerack first then :) [11:33:19] sure, feel free to +2 anytime (I've already +1ed and john too), I'll make a release later [11:41:54] done :) [11:42:19] Hi, CI for RESTbase is broken because of a TLS issue with parsoid-external-ci.access.beta.wmflabs.org. Any ideas for fixing that? It currently prevents any changes to restbase from being merged. The ticket is https://phabricator.wikimedia.org/T350353 [11:43:40] This problem has been around for quite a while for local development, but only recently started to also affect github. I'm surprised it hasn't been broken there for longer. [11:47:43] duesen: already commented there, if that worked in the past some change moved that endpoint from using Let's Encrypt certs to WMF PKI ones [12:42:29] _joe_: when you sit down to review patches could you also look at https://gerrit.wikimedia.org/r/c/operations/puppet/+/975009? [12:42:43] not urgent <3 [13:03:42] (SystemdUnitFailed) firing: export_smart_data_dump.service Failed on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:42] (SystemdUnitFailed) resolved: export_smart_data_dump.service Failed on lvs1013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:09:35] 10Traffic, 10MW-on-K8s, 10SRE, 10serviceops, and 2 others: Migrate mobileapps to k8s - https://phabricator.wikimedia.org/T350846 (10Joe) As you might have noticed by the patches here, we've pivoted as traffic splitting to the canaries via kube-proxy converges over hours, not seconds which is what we'll nee... [15:13:42] (SystemdUnitFailed) firing: (2) prometheus_puppet_agent_stats.timer Failed on lvs2013:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:42] (SystemdUnitFailed) firing: (10) prometheus_puppet_agent_stats.timer Failed on cp1104:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:23:42] (SystemdUnitFailed) resolved: (13) prometheus_puppet_agent_stats.timer Failed on cp1104:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:25:59] volans: let me know when it's safe to merge the puppet side of the spicerack CR [15:26:18] vgutierrez: ack, in a meeting right now, ,I can make a release after that [15:26:36] thx! [15:26:39] I was wiating a sec to get a patch from jo.hn in too [15:26:45] sure [15:39:12] (SystemdUnitFailed) firing: (10) prometheus_puppet_agent_stats.timer Failed on cp1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:44:13] (SystemdUnitFailed) resolved: (11) prometheus_puppet_agent_stats.timer Failed on cp1101:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:49:15] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:27:56] vgutierrez: do you have an idea for fixing this? Right now it's a blocker for all work on restbase. [16:28:28] duesen: that endpoint needs to use a LE certificate rather than an internal one [16:28:42] deployment-prep environment has an acme-chief instance so it shouldn't be a big deal [16:28:48] but I'm not familiar with parsoid puppetization [16:29:49] 10Acme-chief: acme-chief service started on a passive node after reimage - https://phabricator.wikimedia.org/T351655 (10Vgutierrez) [16:30:32] 10Acme-chief, 10Traffic: acme-chief service started on a passive node after reimage - https://phabricator.wikimedia.org/T351655 (10Vgutierrez) p:05Triage→03High [16:48:48] vgutierrez: neither am I... [16:50:44] I can take a look but I suspect that's more for hnowlan's scope than mine [16:53:14] ok, thanks [17:01:55] it looks like profile::tlsproxy::envoy::ssl_provider should be set to acme for deployment-parsoid12 (after configuring acme-chief there to issue the expected certificate) [17:32:23] it's been this way since https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/d0378e96aa0e94cf44b3ad07e4a658990f2a613d%5E%21/#F0 (early 2021), why is it causing problems only now? [18:19:02] vgutierrez: all done, spicerack has ipip_encapsulation support [18:19:16] >>> a.lvs.ipip_encapsulation [18:19:16] False [18:21:08] volans: thx [18:27:36] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host acmechief2001.codfw.wmnet with OS bookworm [19:03:42] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host acmechief2001.codfw.wmnet with OS bookworm completed: - acmechief2001 (**WARN**) - Downtimed on Icinga/A... [19:17:44] (VarnishPrometheusExporterDown) resolved: Varnish Exporter on instance cp4045:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [19:17:58] ^ host was depooled and rebooting, expected