[06:07:26] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [06:30:21] (SLOMetricAbsent) firing: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [06:35:09] (SLOMetricAbsent) resolved: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:01:15] (SLOMetricAbsent) firing: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:06:15] (SLOMetricAbsent) resolved: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:14:31] (SLOMetricAbsent) firing: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [09:19:31] (SLOMetricAbsent) resolved: - https://alerts.wikimedia.org/?q=alertname%3DSLOMetricAbsent [10:31:39] :? [10:34:32] I've also seen other metrics alerts (for NEL) on -operations but not sure if it's related [10:40:05] there were issues with thanos earlier, should be fixed now [10:40:49] tnx [11:45:01] 10netops, 10Infrastructure-Foundations, 10SRE: Use default BGP multihop TTL between devices - https://phabricator.wikimedia.org/T350488 (10cmooney) 05Open→03Resolved Patches merged, all looking ok. For example on dns5004 this was situation before, server using TTL 2, CR using 193: ` 19:27:22.338917 IP (... [12:37:38] hello - I'd like to do some lvs restarts for low-traffic to merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/973827 if that would be okay? [12:46:23] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) I'm gonna close this one for now, if we see an issue again we should get a better error message which should point us to what PuppetDB data triggered i... [12:46:31] 10netops, 10Infrastructure-Foundations, 10SRE: Netbox PuppetDB Import Script Failing for cloudnet1006 - https://phabricator.wikimedia.org/T350479 (10cmooney) 05Open→03Resolved [12:47:06] 10netops, 10Infrastructure-Foundations, 10SRE: Use default BGP multihop TTL between CRs and servers - https://phabricator.wikimedia.org/T350488 (10cmooney) [13:57:36] hnowlan: go for it :) [14:01:05] thanks! doing now [14:01:06] <_joe_> who is maintaining varnishkafka? [14:01:52] <_joe_> a volunteer created a merge request for it [14:01:57] <_joe_> https://gitlab.wikimedia.org/repos/sre/varnishkafka/-/merge_requests/1 [14:02:19] <_joe_> not something /we/ need, but apparently they use VK so it would be great to be good FLOSS maintainers and take a look [14:03:49] _joe_: we (Traffic) maintains it as in the debian packaging for it [14:06:32] <_joe_> sukhe: ok, then this has to do with packaging as well :) [14:06:48] <_joe_> it would be nice if someone took the time to take a look [14:07:52] I'll try have a look [14:08:05] as I did the bookworm package (IIRC) [14:09:06] yeah! [14:10:16] lvs restarts all done, nothing on fire afaict [14:11:16] hnowlan: thanks for keeping the fires away [14:26:42] AFAIK elukey was on top of varnishkafka in the past [14:28:12] code wise yes I can surely check in case it is needed, in this case I think that the change is very simple [14:28:49] (only rpm packaging, I'd say that we can trust the contribution and merge) [14:29:40] I'll try to hit it against rpmlint [14:29:47] and see if if something obvious shows up [14:31:05] do we keep packaging stuff in main or a dedicated branch? [14:41:35] volans: at least prior to the gitlab migration (where this repo is now), we were all over the place [14:41:51] in some dedicated, in some main, in some the branches names were like debian or debian-wmf :P [14:42:03] brett is standardizing it for the gitlab migration I think [14:42:03] yeah I meant in this one [14:42:09] each project has its own [14:52:52] 0 packages and 1 specfiles checked; 0 errors, 0 warnings. [14:53:00] rpmlint seems to be happy :) [15:48:11] holy shit that was quick [15:50:30] volans: We're going to be following Emperor's work on packaging in Gitlab (https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI) [15:50:54] tl;dr bullseye-wikimedia, bookworm-wikimedia etc branches [15:51:20] and e.g. dgit/bookworm branches for source tracking [15:52:20] then I guess you want a separate branch for the rpm stuff too [15:53:00] Are we interested in including other distro build recipes? [15:53:38] Also, would we be interested in adding a 'contrib/' dir with the systemd unit file that's part of the spec [16:17:22] doesn't need to be in contrib, we can just include the one we use in puppet to the main source tree, systemd units are meant to be compatible across distros after all [16:17:45] (the systemd unit, not the spec file) [16:17:46] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin1001 for host acmechief1001.eqiad.wmnet with OS bookworm [16:26:57] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:30:30] 10Traffic, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=154babc2-d86e-4f5b-baf5-fb36e9d129e4) set by fabfur@cumin1001 for 14 days, 0:00:00 on 6 host(s) and thei... [16:31:42] (SystemdUnitFailed) firing: acme-chief-certs-sync.service Failed on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:32:05] that's brett reimaging acmechief1001 [16:32:09] brett: ^ expected? [16:32:09] yeah [16:32:37] thanks [16:32:43] host hasn't been removed from the list of passive hosts so it's still attempting to sync certs [16:32:53] * brett whistles [16:36:07] volans: Sorry for assigning you as reviewer to the varnishkafka MR, I was trying to do multiple reviewers (which we apparently cannot on community edition) [16:37:28] I was a bit puzzled indeed, also I have zero context about varnishkafka and never built an rpm in my life, so clearly you don't want me to review this one ;) [16:38:11] bu no prob, TIL we can't add more reviewers [16:38:34] how do we subscribe to MRs? you know, to get updates on the progress and so on [16:38:48] I can't find the right way, or maybe I overlooked it [16:49:29] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin1001 for host acmechief1001.eqiad.wmnet with OS bookworm completed: - acmechief1001 (**WARN**) - Downtimed on Icinga/A... [16:56:42] (SystemdUnitFailed) resolved: acme-chief-certs-sync.service Failed on acmechief2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:01:15] fabfur: I think you can three-dots at the top-right and then turn on notifications? [18:33:24] 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [18:33:49] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) p:05Triage→03Medium [18:57:19] 10netops, 10Infrastructure-Foundations, 10SRE: Add BGP to protocols contributing to aggregates - https://phabricator.wikimedia.org/T351456 (10cmooney) p:05Triage→03Low [19:23:38] 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10BCornwall) [23:10:33] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c937612c-c0eb-4c9e-a245-9810a56c0a33) set by cmooney@cu...