[00:04:13] (DiskSpace) resolved: Disk space puppetmaster1001:9100:/ 4.878% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace [01:53:42] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:17] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:53:43] (SystemdUnitFailed) firing: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:12:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2003.codfw.wmnet with OS bullseye... [06:39:31] (SystemdUnitFailed) firing: (4) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:39:31] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [06:45:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:43:42] (SystemdUnitFailed) firing: (4) docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:43] (PuppetConstantChange) firing: Puppet performing a change on every puppet run on debmonitor2003:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [10:45:18] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:19:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10cmooney) As discussed with @papaul we may try to connect this to lsw1-a2-codfw instead, so that we can remove the requirement for a leaf switch in... [11:50:04] (NodeTextfileStale) resolved: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:58:08] topranks: i noticed that puppet was stoped on sretest2003 as i moved all sretest to puppet7. i have fixed things howver it is still in a broken state as it cant reach the pki service [11:58:23] btw we dont have any sretest200[12] [12:01:58] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) Discussed with @papaul and we will do this work on Thursday at 11.30am CDT / 16:30 UCT. Shouldn't be any inter... [12:07:18] jbond: hey thanks, yeah sretest2003 I brought online yesterday - it's connected to the new switches in codfw (as is sretest2004, not yet connected), I'm using it just to verifying everything is good on those before moving any real servers. [12:08:05] pa.paul connected two decom'ed servers for me to do that, which I named sretest200[3-4]. We do have other sretest servers but I didn't want to bring them offline for a few weeks [12:08:37] I did notice puppet failed after re-image, not a huge factor for the kind of tests I need to run but better it is properly set up [12:09:26] in terms of it not being able to reach the pki service is that a network issue? [12:15:37] topranks: ack not causing any issues for me, just noticed it in alertmanager. as to pki yes it looks like some firewall issue. connections is timeing out. i didn;t check if it wa the pki firewall os some network acl thught [12:16:15] but as yu say its not comming into service then no need to fix it necesarily [12:18:30] jbond: sounds like the kind of network niggle the testing is designed to catch though - i'll see if I can work out what's happening [12:18:34] thanks for the heads up! [12:19:10] :) great and no probs [12:24:47] The issue is the server connected to new switches can't reach pki.discovery.wmnet as the IP is in private1-b-codfw, which is a vlan they're preconfigured for (thus think should be able to get to locally), but physical connection to existing asws hasn't been done. [12:33:43] (SystemdUnitFailed) resolved: docker-reporter-k8s-images.service Failed on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:56] with any look the alerts for docker-reporter-k8s-images.service and httpbb should now be routed to serviceops [12:37:48] yay [12:38:05] thanks for fixing this! [12:38:23] np [12:44:56] jbond: I'm about to try and reimage / bring sretest2004 online, is there anything specific I should do in terms of puppet7 ? [12:45:22] depends which puppet you want on the new host :D [12:45:25] I've a few simple tests to do then I can fix the reachability issue to pki btw [12:45:26] by default it will get 5 [12:45:30] no matter the OS [12:45:35] I don't really need puppet, just iproute2 and tcpdump :P [12:45:41] rotfl [12:45:55] I think the "sretest" name is why it's using 7 though? [12:46:18] ah maybe john converted the role to it, it's possile [12:46:46] yeah - he did something with sretest2003 manually to fix things, I guess I should repeat for sretest2004 [12:48:01] topranks: so the sretest role has been migrated to puppet7 [12:48:34] so you should reimage with the -p/--puppet-version 7 flag [12:48:55] *I think* [12:49:26] ah gotcha, cool yep will do :) [12:51:26] the I think part is that I'm not sure if the reimage cookbook can automatically detect of the role has been migrated to puppet7 and then set the puppet7 option by itself [12:51:38] or not yet [12:56:12] ok - well it's no problem for me to add the flag when running it for sretest2004 [12:59:22] that causes the cookbook to tell me this actually: [12:59:27] https://www.irccloud.com/pastebin/6YZy2bS9/ [13:00:25] but it's already set for the role so good to proceed I think [13:00:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [13:03:54] topranks: yep [13:04:01] profile::puppet::agent::force_puppet7: true [13:04:05] is set on hieradata/role/common/sretest.yaml [13:36:47] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10SRE, and 2 others: Update reimage cookbooks to work with puppet7 - https://phabricator.wikimedia.org/T348319 (10jbond) [13:37:12] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [13:38:20] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye [15:13:46] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) Above patch reflects my thinking on the best approach for this. I've taken the approach that we should announce all our internal... [15:27:33] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hosts.reimage: fails to get uptime in debian installer - https://phabricator.wikimedia.org/T342345 (10Volans) 05Open→03Resolved a:03Volans Resolving for now, feel free to re-open in case it happens again. [15:38:44] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Connect two hosts in codfw row A/B for switch migration testing - https://phabricator.wikimedia.org/T345803 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1001 for host sretest2004.codfw.wmnet with OS bullseye... [15:41:52] 10netops, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) FWIW in my original config for this I had terms to match routes redistributed into BGP locally and announced in IBGP, or between c... [15:47:37] 10netbox, 10Infrastructure-Foundations, 10observability: Flapping Prometheus metrics for netbox_device_statistics - https://phabricator.wikimedia.org/T276749 (10Volans) 05Open→03Resolved a:03Volans This hasn't happened in a long time. Resolving. [16:19:21] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate mr1-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348164 (10Papaul) @cmooney for the cross rack link it does make sense to use copper with 1000BaseT sine we have those already on site. On the other hand sin... [22:25:13] (DiskSpace) firing: Disk space puppetmaster1001:9100:/ 5.945% free - https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space - https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&viewPanel=12&var-server=puppetmaster1001 - https://alerts.wikimedia.org/?q=alertname%3DDiskSpace