[00:02:13] (SystemdUnitFailed) firing: (3) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:44] (SystemdUnitFailed) firing: (14) dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:07:15] (SystemdUnitFailed) firing: (14) dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:08:45] (SystemdUnitFailed) firing: (14) dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:26:00] 10netops, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: cr2-codfw:xe-1/0/1:1 down - https://phabricator.wikimedia.org/T353256 (10ayounsi) 05Open→03Resolved a:03ayounsi > Dear Customer, > A patch that was incorrectly connected/labelled and the tech fixed it. [08:45:26] 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10ops-monitoring-bot) Draining ganeti5007.eqsin.wmnet of running VMs [09:28:33] 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10ops-monitoring-bot) Draining ganeti5006.eqsin.wmnet of running VMs [09:45:47] 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10ops-monitoring-bot) Draining ganeti5005.eqsin.wmnet of running VMs [10:48:01] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10taavi) [10:52:44] 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10ops-monitoring-bot) Draining ganeti5004.eqsin.wmnet of running VMs [11:31:15] 10netops, 10Ganeti, 10Infrastructure-Foundations: prometheus5002 unable to ping ipv6 ganeti500[74] eqsin - https://phabricator.wikimedia.org/T353254 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This has been fixed, please reopen if you run into other network issues with the eqsin Ganet... [12:08:51] (SystemdUnitFailed) firing: (14) dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:17:46] I've forced run, it failed tonight becxause of a connection failed [12:22:14] (SystemdUnitFailed) firing: (14) dump_cloud_ip_ranges.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:02:54] any idea what's up with https://alerts.wikimedia.org/?q=alertname%3DPuppetPendingCertificateRequest&q=cluster%3Dpuppet&q=team%3Dsre&q=%40receiver%3Ddefault ? 10.194.62.1 is in https://netbox.wikimedia.org/ipam/prefixes/534/ which shouldn't have any real servers [14:04:09] moritzm: not sure if known but this alerts about not being healthy https://alerts.wikimedia.org/?q=alertname%3DSmartNotHealthy&q=cluster%3Dganeti&q=team%3Dsre&q=%40receiver%3Ddefault [14:18:11] I'll open a DC ops ticket, it's fairly new and still under warranty [14:21:13] the cert could have been some sort of misconfiguration and the CSR was never signed maybe [16:22:15] (SystemdUnitFailed) firing: (13) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:05:16] I'm getting PuppetFailed alerts on a host that's insetup (elastic1107)...does anyone know if this is expected? [18:07:57] 10Puppet, 10Wikimedia Meet: Puppetize the jitsi instance - https://phabricator.wikimedia.org/T251040 (10Ladsgroup) 05Open→03Declined Wikimedia Meet has been retired [18:09:07] inflatador: AFAIK the insetup roles set hieradata profile::monitoring::notifications_enabled that in turns sets the notifications enabled bit in Icinga to 0 and hence disables them on icinga, but I'm not sure if alertmanager has the same mechanism or is able to read the same thing [18:09:24] as for puppet on that host I can see is failing multiple times: https://puppetboard.wikimedia.org/node/elastic1107.eqiad.wmnet [18:10:31] volans interesting. We have several hosts (110[4-7] or something like that) that are still in DC Ops' care...not sure why that one is the only one complaining. None of them were accessible via SSH last time I tried [18:11:11] that said hosts in insetup should work fine and not have this kind of failure [18:11:19] so if it's failing something is not right [18:11:48] I'll defer to o11y for the support of profile::monitoring::notifications_enabled for alertmnager [18:11:59] and to dcops for the status of these speficic hosts [18:12:08] sorry, gotta go right now [18:13:04] No worries, have a good one ;) [19:14:16] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) p:05Triage→03Low [19:19:42] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) [19:23:42] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) Also - we should change the regexp to also catch "et-" prefixes for 25G interfaces ` REGEXP "^(g|x)e-[0-9]+/[0-9]+/[0-9]+" ` [20:07:51] v-olans FYI, I asked in #observability and they said that AM is not affected by insetup or profile::monitoring::notifications_enabled [20:16:39] 10Puppet, 10Instrument-ClientError: Google Translate and other translate services triggering client error alert - https://phabricator.wikimedia.org/T351738 (10colewhite) Patch is merged. I see a corresponding [[ https://grafana-rw.wikimedia.org/explore?orgId=1&left=%7B%22datasource%22:%22000000026%22,%22queri... [20:23:46] (SystemdUnitFailed) firing: (13) geoip_update_main.service Failed on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:38:09] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) [20:45:40] 10SRE-tools, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10Dzahn) [20:46:33] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10Dzahn) [21:16:19] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10ayounsi) More or less a duplicate of {T306007} [21:20:41] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) [21:20:53] 10netbox, 10netops, 10DC-Ops, 10Infrastructure-Foundations, 10SRE: Avoid ghost hosts on the network - https://phabricator.wikimedia.org/T306007 (10cmooney) [21:22:12] 10netops, 10Infrastructure-Foundations, 10SRE: Adjust "port with no description on access switch" alert - https://phabricator.wikimedia.org/T353364 (10cmooney) Ah yeah I'd forgotten about that one. What do you think about changing the alert text? I'm sure after investigating today I'll remember the details... [21:50:02] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.ganeti.GanetiError: Error while performing request to RAPI - https://phabricator.wikimedia.org/T353379 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff You need to run Ganeti-related cookbooks from cumin2002 until cumin1001...