[02:04:19] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:32] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:47:34] is anyone using sretest1003, I'd like to use it to investigate https://phabricator.wikimedia.org/T304483 a bit more [08:55:07] not me :) [08:58:15] me neither [09:05:10] hmmm [09:06:44] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: sre.hardware.upgrade-firmware fails with "unable to extract version" - https://phabricator.wikimedia.org/T355649 (10ayounsi) [09:06:48] https://phabricator.wikimedia.org/T355649 [09:12:07] XioNoX: the _02 part I think is not "allowed" by the current regex [09:13:07] yeah, looking into it [09:13:12] no reason not to accept it [09:13:35] sure but you need to know which version you should extract [09:14:00] what do you mean? [09:14:14] that regex extracts a version that it's then used [09:14:21] used how? [09:14:27] dunno [09:17:25] it compares it with what's currently running [09:18:12] so you need to exract something comparable [09:19:22] the BIOS regex excludes it [09:19:36] 'BIOS': r'(?P(\d{1,2}\.){2}\d{1,2})(?:_\d+)?$', [09:20:07] that includes it [09:20:11] 22.71.3_02 [09:20:17] not in $version [09:21:43] right [09:21:59] so hten how to compare 22.71.3_02 with 22.71.3_03 ? [09:23:40] if we decided to not compare it for bios, I guess it's fine to do the same for NICs ? [09:24:39] Ideally I'd like to just fix that bug, and not re-investigate the whole thing [09:26:25] :D [09:26:40] looking at version history there has never been a _01, _02 with the same version [09:26:51] ok [09:27:16] for example there is Network_Firmware_3JHHP_WN64_22.71.11.13_01.EXE for NetXtreme-E and Network_Firmware_4JJW6_WN64_22.71.3_02.EXE for NetXtreme [09:27:34] ack [09:31:15] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hardware.upgrade-firmware fails with "unable to extract version" - https://phabricator.wikimedia.org/T355649 (10ayounsi) [09:31:31] volans: https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/992365 [09:33:11] +1ed [09:33:15] thx! [09:34:49] XioNoX: given you're on the topic, I wonder if we should move existing cached files from cumin1001 to 1002 [09:35:39] volans: nononononon, I'm just fixing a tiny bug, I'm not the owner of the thing [09:35:42] :) [09:35:56] too late [09:36:19] if it's just a cache I'd say we don't have to care [09:36:36] yes but, older version are not shown automatically [09:36:50] and we migh need them becaue we do downgrade 10G nics for example [09:36:57] and it's an scp-away [09:37:07] then yes :) [09:37:12] or at least just that one [09:37:53] can't harm to move them I'd say, I don't think we'd have any there we haven't had a legit need for? [09:39:07] but they can be unnecessary now an just eating disk space [09:39:34] 8.2G /srv/firmware/ [09:39:50] /dev/vda1 ext4 78G 13G 61G 18% / [09:43:26] volans https://www.irccloud.com/pastebin/cC8lanDo/ [09:44:50] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: sre.hardware.upgrade-firmware fails with "unable to extract version" - https://phabricator.wikimedia.org/T355649 (10ayounsi) 05Open→03Resolved a:03ayounsi [09:45:49] XioNoX: cumin1002 can't connect to sretest1003 [09:45:54] I said go just in case, let's see what happens [09:46:10] $ telnet sretest1003.eqiad.wmnet 22 [09:46:10] Trying 2620:0:861:101:10:64:0:13... [09:48:33] XioNoX, topranks: FYI capirca script in netbox is timing out: Task exceeded maximum timeout value (300 seconds) [09:48:50] hmm [09:48:58] did anything changed? [09:48:58] yeah we discussed it earlier today [09:49:10] it worked for you when you ran it XioNoX? [09:49:26] yeah, but failed twice for taavi [09:49:37] and the daily check [09:49:53] volans: for the daily check it's just bad timing [09:50:24] but not sure why the capirca script is in the picture right now [09:50:44] doesn't impact connectivity between cumin1002 and sretest1003 on their prod interfaces [09:50:58] nothing, totally unrelated thing [09:51:20] volans: should I jsut reboot sretest1003 then to pickup the firmware upgrade? [09:52:01] check the cookbook code, not sure what it does next and if a cold reboot (from idrac) is needed (as opposed to OS reboot) [09:52:23] but the cold reboot from idrac would hae worked [09:53:45] looking at nftables on sretest1003 its supposed to allow ssh from cumin1002 [10:05:18] The manual reboot fixed the SSH issue [10:05:32] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:08:33] https://konecipv4.cz/en/ "On the basis of this decision, the Czech state administration will stop providing its services over IPv4 on 6 June 2032. Thus, the Czech Republic knows its IPv4 shutdown date." [10:18:13] huh [10:18:29] that's kind of cool actually, like it will 100% force ISPs in the country to provide IPv6 [10:18:52] this site or that site could go v6 only - and ISPs could ignore - but govt. services access is a must [10:20:49] yeah, the only downside is for people abroad (and no v6) trying to reach gov services [10:20:57] but it's indeed great that they do it [10:28:25] sorry was in a meeeting [10:29:06] XioNoX: weird for the ssh and reboot [10:29:09] upgrade didn't work... [10:29:14] I left a comment for DCops https://phabricator.wikimedia.org/T304483#9480247 [10:29:45] saw now [10:29:47] hmm yeah that last point is a thing - stuck in a life threatening situation in some far-flung country and you can't email the embassy about your passport replacement ;) [10:30:06] XioNoX, volans: so I was getting this error trying to run Homer for the CRs on cumin1002 [10:30:30] https://www.irccloud.com/pastebin/onsdCBpQ/ [10:30:47] The "last run status" of the script in Netbox was OK when I checked [10:31:01] But I clicked to run anyway to see if it made a difference, and it timed out on me :( [10:31:07] topranks: I ran it recently, so maybe its jsut a race condition ? [10:31:52] XioNoX: yeah the homer error possibly I ran while you were [10:31:56] let me try running it again [10:32:23] RQ seems to be easily overwhelmed [10:32:34] RQ? [10:33:17] oops, I mean Redis [10:33:29] ok yep youmentioned that earlier [10:33:43] yeah it ran clean just now (script in netbox) [10:33:45] let me try homer [10:35:28] Homer worked fine that time - I assume I just ran while you were re-running script the last time [12:26:49] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=757944ae-dc8f-4433-9c0c-e68dc04b371b) set by kamila@cumin1002 for 4:00:... [12:29:32] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review: Reimage cookbook fails to downtime hosts when run concurrently - https://phabricator.wikimedia.org/T355187 (10kamila) 05Open→03Resolved I believe the above patch fixed it, so I'm closing this. I will reopen in case I see the race again. [14:05:32] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:19:20] (SystemdUnitFailed) firing: (2) prometheus-ganeti-exporter.service Failed on ganeti1037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:34:00] 10Puppet, 10SRE-tools, 10Infrastructure-Foundations, 10SRE, and 4 others: Forward port Python2 files to Python3 in Puppet Repository - https://phabricator.wikimedia.org/T247364 (10SLyngshede-WMF) [14:39:19] (SystemdUnitFailed) firing: (2) prometheus-ganeti-exporter.service Failed on ganeti1037:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:04:19] (SystemdUnitFailed) firing: (2) update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:18:57] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [15:19:57] 10netops, 10Data-Persistence, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack B5 from asw-b5-codfw to lsw1-b5-codfw - https://phabricator.wikimedia.org/T355549 (10Marostegui) [16:41:55] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Gehel) [17:04:20] (SystemdUnitFailed) firing: (2) update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:05:32] (SystemdUnitFailed) firing: (2) update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:26:53] looking at the tails issue ^ [21:05:32] (SystemdUnitFailed) firing: (2) update-tails-mirror.service Failed on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed