[01:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [02:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:42:56] FIRING: ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:47:56] FIRING: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:52:56] RESOLVED: [2x] ProbeDown: Service mirror1001:443 has failed probes (http_mirrors_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mirror1001:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:46:04] netmon1003 changes at every puppet run is for this: [08:46:41] /Stage[main]/Librenms/File[/srv/deployment/librenms/librenms/storage/framework/cache/data/27/45/2745da5ffa1c30968efa55bffb4f58b2d7a690a8]/owner owner changed 'librenms' to 'www-data' (corrective) (and permissions too) [08:46:48] always the same cache item [08:46:56] see https://puppetboard.wikimedia.org/report/netmon1003.wikimedia.org/b71a126c2b1d14b07fcff6e2fdc55ebc8411bca5 for example [08:53:29] might be related to the recent librenms update/deploy? it was updated on Friday for https://phabricator.wikimedia.org/T384036 [09:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:05:18] hmm yeah I assume that is since the upgrade too [10:05:58] not sure what the fix is, librenms runs as user librenms so makes sense it creates the files that way. But most of the contents of those directories is owned by www-data and group librenms [10:16:16] I opened https://phabricator.wikimedia.org/T384440 [10:38:31] 10netops, 06Infrastructure-Foundations, 06SRE: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10483284 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe2806ef-4f5c-4485-981c-52b89f9e3154) set by cmooney@cumin1002 for 2:00:00 on 1 host(s) and th... [10:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:14:01] 10SRE-tools, 06Infrastructure-Foundations: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462 (10jcrespo) 03NEW [12:41:46] 10SRE-tools, 06Infrastructure-Foundations: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#10483791 (10Volans) p:05Triage→03Low If we want to catch the specific error of unauthorized that should be done in Spicera... [12:42:01] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Reimaging with a cookbook this kind of server with a wrong management password leads to exception - https://phabricator.wikimedia.org/T384462#10483794 (10Volans) [12:44:35] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10483800 (10MoritzMuehlenhoff) [13:06:06] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10483855 (10cmooney) The above patch adds BGP stats collection to our current setup. Tested in Magru and working well, albeit with a few quirks disc... [13:16:09] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10483868 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ba072b6c-6957-428b-a932-dfcf0b3f8103) set by cmooney@cumin1002 for 2:00:... [13:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:52:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [13:57:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:02:49] FIRING: [3x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:17:11] 10netops, 06Infrastructure-Foundations, 06SRE: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473 (10cmooney) 03NEW p:05Triage→03Low [14:17:15] 10netops, 06Infrastructure-Foundations, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#10484100 (10cmooney) [14:17:17] 10netops, 06Infrastructure-Foundations, 06SRE: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484099 (10cmooney) [14:17:35] 10netops, 06Infrastructure-Foundations, 06SRE: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484103 (10cmooney) [14:17:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:18:45] 10netops, 06Infrastructure-Foundations, 06SRE: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484104 (10cmooney) [14:19:16] 10netops, 06Infrastructure-Foundations, 06SRE: Enable BGP multipath at internet edge - https://phabricator.wikimedia.org/T384473#10484107 (10cmooney) [14:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [14:43:26] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07IPv6: Enable ipv6 on ganeti2019-ganeti2024 - https://phabricator.wikimedia.org/T379890#10484178 (10MoritzMuehlenhoff) [14:48:15] I'm temporarily switching aux-k8s-etcd2004 to DRBD to move it off a ganeti node which will be reimaged, latencies will briefly go up a bit [14:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:30:30] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10485502 (10andrea.denisse) Looking at the changelog I wonder if this issue could be related to this [[ https://github... [17:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:04:12] 10netops, 06Infrastructure-Foundations, 10observability, 06SRE: LibreNMS reporting no routes learnt from doh/durum Anycast peers at various POPs - https://phabricator.wikimedia.org/T384258#10485755 (10cmooney) >>! In T384258#10485502, @andrea.denisse wrote: > Looking at the changelog I wonder if this issue... [18:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [18:48:13] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10485972 (10CDanis) > All of this does suggest we should probably look at running distributed collectors as we move to productionize this, potentiall... [18:51:25] FIRING: SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:41:56] FIRING: MaxConntrack: Max conntrack at 81.21% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:46:55] RESOLVED: MaxConntrack: Max conntrack at 81.21% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [20:16:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:21:26] FIRING: [2x] SystemdUnitFailed: debian-weekly-rebuild.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:34:02] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10486590 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe40d399-fce9-41c4-b12a-4bcb36770f4b) set by cmooney@cumin1002 for 1:00:... [21:34:49] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netmon1003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [21:47:18] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10486643 (10cmooney) >>! In T369384#10485972, @CDanis wrote: > The aux clusters are waiting for us :D and we do have one in codfw as well now. Yep i... [22:07:49] FIRING: [4x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-worker1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [22:22:49] FIRING: [2x] PuppetConstantChange: Puppet performing a change on every puppet run on aux-k8s-ctrl1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [23:46:53] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Productionize gnmic network telemetry pipeline - https://phabricator.wikimedia.org/T369384#10487032 (10cmooney)