[08:29:27] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:52] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Decommission CAS 6 hosts - https://phabricator.wikimedia.org/T372997#10090802 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1002 for hosts: `idp2003.wikimedia.org` - idp2003.wikimedia.org (**PASS**) - Downtim... [09:08:09] 10netbox, 06Infrastructure-Foundations: Netbox: basic change rollback - https://phabricator.wikimedia.org/T310589#10090992 (10ayounsi) I had a try at this. See attached screenshot for using the "offline device" script, then the "revert" script using the request ID. {F57294424} {F57294423} The "before/after... [09:27:19] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Decommission CAS 6 hosts - https://phabricator.wikimedia.org/T372997#10091074 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1002 for hosts: `idp1003.wikimedia.org` - idp1003.wikimedia.org (**PASS**) - Downtim... [09:29:27] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:37:15] 10CAS-SSO, 06Infrastructure-Foundations, 13Patch-For-Review: Decommission CAS 6 hosts - https://phabricator.wikimedia.org/T372997#10091123 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by slyngshede@cumin1002 for hosts: `idp-test1003.wikimedia.org` - idp-test1003.wikimedia.org (**PASS**)... [09:57:38] Hi. Some of the packages present in the Docker registry are not visible in Debmonitor: a search for docker-registry.wikimedia.org/repos or docker-registry.wikimedia.org/releng comes up blank. Is this deliberate or do those prefixes need to be opted-in somehow? [09:59:18] Probably not deliberate, elukey should be around in a few hours, so maybe we can trick him into taking a look. [10:00:08] XioNoX: that 'revert' script looks interesting! [10:00:12] ambitious! [10:00:32] overall if it can help even in some scenarios seems like a good idea, I'll take a closer look later on [10:00:47] yeah it started very ambitious, and was quite fun to write [10:02:41] but then I started to discover the various limitations and drawbacks based on the changelog data itself [10:24:27] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:19:27] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx-in2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:31:18] 10Mail, 06collaboration-services, 06Infrastructure-Foundations, 10vrts: generate_vrts_aliases failing on mx-in1001 - https://phabricator.wikimedia.org/T368257#10091458 (10ayounsi) It would be useful to know why it failed (maybe on the server's logs?), but +1 to adding a retry logic regardless. [11:52:36] 10netbox, 06DC-Ops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: sre.hardware.upgrade-firmware cookbook: product slug parsing - https://phabricator.wikimedia.org/T348036#10091553 (10ayounsi) Deployed on netbox-next and tests seem all good. [12:15:18] 10netbox, 10SRE-tools, 06Infrastructure-Foundations, 13Patch-For-Review: netbox: decided how to deal with blank mgmt dns_names - https://phabricator.wikimedia.org/T339121#10091611 (10ayounsi) 05Open→03Resolved Validator deployed. [12:31:32] sobanski: o/ [12:34:09] so debmonitor uses a special client on build2001 that is called docker-report. It fetches the catalog from the registry, apply some filters and then it reports every last tag for the docker images post-filters [12:36:39] we don't really exclude releng's images [12:36:46] but there are a few caveats [12:37:02] 1) At the moment we support only debian bullseye+ images [12:37:28] 2) (this is the part that I still don't know very well) we do purge from time to time old images' reports [12:38:40] we can open a task in case [12:42:01] Thanks for the explanation, I'll open a task [12:43:20] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091725 (10ABran-WMF) preparation job with the first few critical instances on the path is done for now. I'll have a few host to mo... [12:56:17] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091780 (10ABran-WMF) this task depends on: T373175 [12:57:18] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D7 & D8 from asw to lsw - https://phabricator.wikimedia.org/T373105#10091785 (10ABran-WMF) [12:57:26] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091786 (10ABran-WMF) [12:59:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate servers in codfw racks D5 & D6 from asw to lsw - https://phabricator.wikimedia.org/T373104#10091794 (10ABran-WMF) [12:59:31] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-codfw, 06SRE: Migrate codfw servers in rows C & D from legacy ASW to LSW - https://phabricator.wikimedia.org/T370630#10091795 (10ABran-WMF) [13:21:50] someone around for an easy +1? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1066770 [13:22:02] * topranks looking [13:22:29] +1 [13:22:30] thx! [13:22:36] not sure why some of them are so specific good call [13:36:09] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10091977 (10Clement_Goubert) 05Open→03In progress [13:53:22] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092081 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:02:52] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092094 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:08:11] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092103 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:17:00] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092168 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:24:08] elukey: `python3-pynetbox is already the newest version (7.4.0).` so I guess I jsut had to wait for "something" to refresh? :) [14:24:35] 10Mail, 06Infrastructure-Foundations, 10MediaWiki-Email, 06SRE: Old "Email this user" email is repeatedly resent - https://phabricator.wikimedia.org/T361860#10092204 (10jhathaway) 05Open→03Resolved a:03jhathaway @Xover, I am going to assume this is no longer occurring, please reopen, if it occurs... [14:25:36] slyngs: https://wikitech.wikimedia.org/wiki/Debian_packaging_with_dgit_and_CI [14:27:58] 10netops, 06DC-Ops, 06Infrastructure-Foundations, 10ops-eqiad: cr1-eqiad: disk failure - https://phabricator.wikimedia.org/T372781#10092235 (10ayounsi) [14:39:56] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:44:39] 07Puppet, 06Infrastructure-Foundations, 06Release-Engineering-Team: Puppet git::clone should default mode to 0644 (read-only) instead of 0755 - https://phabricator.wikimedia.org/T371980#10092329 (10joanna_borun) p:05Triage→03Low [14:44:39] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092328 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [14:46:19] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890#10092334 (10ayounsi) 05Open→03Resolved a:03ayounsi It's all good now, I guess we just had to wait a little bit. I updated the Netbox doc so we know... [14:46:35] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092348 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:48:35] 10Mail, 06Infrastructure-Foundations, 06SRE, 10Wikimedia-Mailing-lists, 07Security: Sender email spoofing - https://phabricator.wikimedia.org/T160529#10092353 (10joanna_borun) p:05High→03Medium [14:53:58] 10Mail, 06Infrastructure-Foundations, 06SRE: exim should log the reason for defer with disconnect after HELO/EHLO - https://phabricator.wikimedia.org/T265142#10092368 (10jhathaway) 05Open→03Declined We have have moved to Postfix for ingress and egress, so declining. [14:54:19] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092370 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [14:55:04] 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team, 06SRE: SSO kill switch for crucial services - https://phabricator.wikimedia.org/T233938#10092374 (10joanna_borun) p:05Medium→03Low [14:55:14] 10Mail, 06Infrastructure-Foundations, 06SRE, 10Wikimedia-Mailing-lists: Email to WikimediaUA mailing list from base-w[at]yandex.ru does not get delivered - https://phabricator.wikimedia.org/T247603#10092376 (10jhathaway) @Base is this issue still ongoing? [14:56:00] 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team, 06SRE: CAS Single Logout Flow - https://phabricator.wikimedia.org/T233941#10092382 (10SLyngshede-WMF) a:03SLyngshede-WMF [14:57:16] 10CAS-SSO, 06Infrastructure-Foundations, 06Security-Team, 06SRE: Maintain session history / audit log - https://phabricator.wikimedia.org/T233942#10092389 (10SLyngshede-WMF) p:05Medium→03Low a:03SLyngshede-WMF [14:58:48] 10Mail, 06Infrastructure-Foundations, 10Observability-Alerting, 06SRE: Fix paniclog alert to only sent mails once - https://phabricator.wikimedia.org/T257016#10092392 (10jhathaway) 05Open→03Declined Since we have migrated to Postfix, and Postfix doesn't have a panic log, declining. [14:58:50] 10SRE-tools, 06Infrastructure-Foundations, 06SRE, 07Python3-Porting: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435#10092397 (10elukey) [15:00:24] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092408 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:05:39] 10SRE-tools, 06Infrastructure-Foundations: Better detection for "reboot into PXE failed" conditions in wmf-auto-reimage - https://phabricator.wikimedia.org/T261956#10092436 (10joanna_borun) 05Open→03Declined [15:11:13] 10SRE-tools, 10Cloud-VPS, 06Infrastructure-Foundations: Update offboard-user script to use Keystone API - https://phabricator.wikimedia.org/T306788#10092464 (10SLyngshede-WMF) a:03SLyngshede-WMF [15:15:17] 10SRE-tools, 10Icinga, 06Infrastructure-Foundations, 10observability, 06SRE: Icinga paged for a host that should have been downtimed - https://phabricator.wikimedia.org/T309447#10092518 (10joanna_borun) 05Open→03Resolved [15:21:24] XioNoX: where did you get that message? [15:21:57] elukey: that's trying sudo apt install python3-pynetbox on cumin [15:22:14] last week it was asking me if I wanted to upgrade to 6.6.0 from 7.4.0 [15:22:19] anyway, problem solved :) [15:22:44] after installing the 7.4.0 deb manually? [15:23:09] (trying to understand where the inconsistency was) [15:23:39] you can always use `apt-cache policy $packagename` to see what apt suggests [15:23:53] yeah exactly [15:23:54] new versions, where they are taken from etc.. [15:24:09] and after adding 7.4.0 to apt [15:24:15] thx I didn't know about that command! [15:24:16] in theory it shouldn't have done it, maybe a apt-get update was needed? [15:24:52] it is really useful, especially if you want to know where the packages is taken from (like apt components etc., and their priorieties) [15:32:52] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092677 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [15:37:31] slyngs: that sounds too easy https://gerrit.wikimedia.org/r/1066799 :) [15:40:49] jayme: ^ you might be interested too. Do you have an example dashboard that uses the drbd metrics ? Ideally something to duplicate for Ganeti [15:52:03] XioNoX: wong J.aime? 🤔 [15:52:08] *wrong [16:01:33] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092863 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [16:01:54] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092864 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:01:58] XioNoX: Do we know if it works ? But seems straight forward [16:05:21] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10092871 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host w... [16:36:33] jayme: I think it is you, we were reviewing old tasks and one came up with you asking for drbd metrics on ganeti nodes :) [16:37:03] ah, I recall. Probably because of k8s etcd [16:37:43] 🤦 didn't look at the task, sorry [16:49:14] so no, unfortunately I don't have a DRBD dashboard at hand. I think I just came across missing date while digging into some issue [16:52:55] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093070 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host wikik... [16:57:14] hey folks, for awareness - there are ongoing alerts for multiple puppetmasters related to port 8141, I opened https://phabricator.wikimedia.org/T373369 [16:57:22] it is an alerting issue, nothing exploding afaics [19:13:43] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093775 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by kamila@cumin1002 from kubernetes20... [19:14:48] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093792 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wik... [19:59:55] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10093998 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikub... [21:09:53] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094187 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.rename started by swfrench@cumin2002 from kubernetes... [21:17:25] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094193 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by swfrench@cumin2002 for host w... [22:05:56] 10netops, 06Infrastructure-Foundations, 06serviceops, 06SRE, 13Patch-For-Review: Re-IP wikikube servers in codfw row A/B moving to per-rack subnets - https://phabricator.wikimedia.org/T372878#10094339 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by swfrench@cumin2002 for host wikik... [22:33:40] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:31:10] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed