[00:14:26] FIRING: [2x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:19:26] FIRING: [4x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:44] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:24:26] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:44] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:30:44] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:34:26] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:26] FIRING: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:45:44] RESOLVED: [5x] SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:54:26] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:49:26] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:44:31] 10netops, 06DBA, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Migrate codfw row C & D database hosts to new Leaf switches - https://phabricator.wikimedia.org/T370852#10083924 (10ABran-WMF) after a quick chat with @cmooney, I've taken inventory of the 87 servers to handle: |**rack**|**node**|**clus... [08:48:44] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Decommission CAS 6 hosts - https://phabricator.wikimedia.org/T372997#10083929 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1002 for hosts: `idp-test1002.wikimedia.org` - idp-test1002.wikimedia.org (**PASS**)... [08:58:08] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Decommission CAS 6 hosts - https://phabricator.wikimedia.org/T372997#10083956 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1002 for hosts: `idp-test2002.wikimedia.org` - idp-test2002.wikimedia.org (**PASS**)... [10:29:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10084199 (10ayounsi) Disk is gone : `name=show vmhost hardware re0 re0: [...] Item Capacity Part number... [10:31:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10084200 (10ayounsi) a:03ayounsi [11:19:20] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084337 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [11:36:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 2 others: lvs2011: Move existing row C & D vlans to primary uplink and add new ones - https://phabricator.wikimedia.org/T370891#10084369 (10cmooney) a:05cmooney→03None [11:41:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084371 (10cmooney) [11:50:51] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Login attempts from bd808 get 500 on Debmonitor - https://phabricator.wikimedia.org/T369205#10084395 (10SLyngshede-WMF) My attempt at simply not shipping groups failed, the group membership test is done on the "client" and not on the IDP host. We n... [12:07:55] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095 (10cmooney) 03NEW p:05Triage→03Medium [12:08:51] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084514 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [12:09:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10084515 (10cmooney) [12:09:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084516 (10cmooney) [12:13:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096 (10cmooney) 03NEW p:05Triage→03Medium [12:13:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10084536 (10cmooney) [12:13:21] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084537 (10cmooney) [12:16:43] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097 (10cmooney) 03NEW p:05Triage→03Medium [12:16:50] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084559 (10cmooney) [12:16:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10084558 (10cmooney) [12:22:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084589 (10Clement_Goubert) [12:25:22] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101 (10cmooney) 03NEW p:05Triage→03Medium [12:25:51] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10084619 (10cmooney) [12:25:53] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084620 (10cmooney) [12:26:01] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084622 (10Clement_Goubert) [12:26:11] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10084623 (10Clement_Goubert) [12:26:44] 10netbox, 06Infrastructure-Foundations: taavi's netbox-next account is stuck - https://phabricator.wikimedia.org/T351950#10084626 (10ayounsi) @taavi are you still having issues here? [12:28:18] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102 (10cmooney) 03NEW p:05Triage→03Medium [12:28:28] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10084644 (10cmooney) [12:30:35] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103 (10cmooney) 03NEW p:05Triage→03Medium [12:31:52] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10084667 (10cmooney) [12:31:54] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084668 (10cmooney) [12:36:44] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104 (10cmooney) 03NEW p:05Triage→03Medium [12:37:06] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10084691 (10cmooney) [12:37:09] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084692 (10cmooney) [12:39:57] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105 (10cmooney) 03NEW p:05Triage→03Medium [12:40:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10084727 (10cmooney) [12:40:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10084726 (10cmooney) [12:50:47] If idp-test seems a little unstable it's because I'm attempting to do some filter magic on LDAP groups in the CAS config [13:29:20] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10084880 (10ayounsi) a:03ayounsi Taking the task to create the validator [13:34:33] 10netbox, 06Infrastructure-Foundations: Netbox logs filling up disk, netbox1002 - https://phabricator.wikimedia.org/T371036#10084901 (10ayounsi) 05Open→03Resolved a:03ayounsi netbox1002 is gone :) Netbox 4 servers have bigger disks and getstats (which was generating them) has been replaced by a plugin. [13:34:48] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw rack C1 from asw-c1-codfw to lsw1-c1-codfw - https://phabricator.wikimedia.org/T373095#10084905 (10cmooney) [13:35:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10084907 (10cmooney) [13:35:45] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C2 & C3 from asw to lsw - https://phabricator.wikimedia.org/T373096#10084911 (10cmooney) [13:36:10] 10netbox, 06Infrastructure-Foundations: Netbox: PuppetDB import script error with VMs - https://phabricator.wikimedia.org/T340190#10084915 (10ayounsi) 05Open→03Resolved a:03ayounsi Probably safe to close as it has been more than a year. [13:36:12] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C4 & C5 from asw to lsw - https://phabricator.wikimedia.org/T373097#10084920 (10cmooney) [13:36:34] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks C6 & C7 from asw to lsw - https://phabricator.wikimedia.org/T373101#10084921 (10cmooney) [13:36:41] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D1 & D2 from asw to lsw - https://phabricator.wikimedia.org/T373102#10084925 (10cmooney) [13:36:59] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D3 & D4 from asw to lsw - https://phabricator.wikimedia.org/T373103#10084926 (10cmooney) [13:37:15] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10084928 (10cmooney) [13:37:36] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10084929 (10cmooney) [14:09:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10085064 (10cmooney) [14:47:25] 10netbox, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10085226 (10ayounsi) a:03ayounsi [14:48:21] 10netbox, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10085225 (10ayounsi) Added a relevant check in the IP validator. I used the following nbshell code on Netbox next to confirm that it wo... [14:56:47] 10netbox, 06Infrastructure-Foundations: Netbox: import from PuppetDB script creates VIP also if exists - https://phabricator.wikimedia.org/T278936#10085270 (10ayounsi) 05In progress→03Declined The changelog links have expired. I tried to reproduce the issue with other similar hosts (gitlab, gerrit, etc... [15:00:10] 10netbox, 06Infrastructure-Foundations: Netbox: manage VRRP priorities - https://phabricator.wikimedia.org/T319301#10085284 (10ayounsi) 05Open→03Declined [15:29:19] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085381 (10cmooney) [15:30:04] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085386 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4888a1d9-ee36-415c-a204-98c84040effe) set... [15:30:14] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10085351 (10Papaul) 05Open→03Resolved a:03Papaul [15:30:25] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085387 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=028cbb12-db86-4824-9084-463287cc8911) set... [15:50:49] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10085491 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host lvs2... [16:15:57] cdanis: Renil asking in the WME slack channel if I "know if the above IPs have been added to the allow list?" [16:16:21] I'm not sure there really is such a thing, or how exactly to respond without being incorrect [16:16:32] there is such a thing [16:16:38] I guess they might hit the permanent public cloud limiters? [16:16:39] ah ok [16:16:46] it's at the very top of one of the varnish VCLs [16:16:46] TIL [16:17:03] need Traffic to take care of that? [16:17:06] (I can do it) [16:17:16] https://gerrit.wikimedia.org/g/operations/puppet/+/435e8dd23177eb2d7173d30d1f606f7d531e7f56/modules/varnish/templates/wikimedia-frontend.vcl.erb#70 [16:18:16] cdanis: ah ok cool thanks [16:18:26] the more recent IPs they gave us aren't there [16:18:33] cdanis, sukhe: I can prep a patch if that makes sense? [16:18:47] topranks: go for it; happy to roll it out for you [16:18:51] <3 [16:18:54] (we have another patch in progress, I will do it after that) [16:19:02] ok I'll give it a shot [16:32:44] sukhe: thanks for the review [16:32:57] topranks: merging shortly, will let you know. another change is rolling out currently :> [16:33:02] I upped another patchset just now - just changing the comment so it doesn't refer to "two addresses" [16:33:08] there is no rush I think [16:33:09] thanks! [16:33:16] lvs2013 looks good btw [16:33:22] good catch! [16:33:24] thanks <3 [16:33:30] back in service? [16:34:27] not yet will do so shortly [16:34:38] no worries [16:36:08] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2014: move uplink to lsw1-d2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370897#10085669 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host lvs2013.... [17:19:26] FIRING: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:23:02] ehm [17:23:07] oh that's old [17:24:00] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, and 3 others: lvs2013: move uplink to lsw1-c2-codfw and connect to per-rack vlan - https://phabricator.wikimedia.org/T370927#10085767 (10cmooney) a:05cmooney→03None All work completed, no issues to report. @Jhancock.wm @Papaul these two cr... [17:24:26] RESOLVED: SystemdUnitFailed: sync-puppet-volatile.service on puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed