[00:03:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[00:04:23] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:03:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[04:04:23] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:24:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:47:32] <wikibugs>	 10netops, 06Infrastructure-Foundations: cr2-codfw - Host 0 ECC single bit parity error - https://phabricator.wikimedia.org/T371868 (10ayounsi) 03NEW p:05Triage→03Low
[07:19:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:03:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[08:42:50] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Test Netbox-More-Metrics plugin on Netbox 4.0 - https://phabricator.wikimedia.org/T365989#10044332 (10ayounsi) 05Open→03Resolved testing done!
[10:03:30] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879 (10cmooney) 03NEW p:05Triage→03High
[10:14:00] <btullis>	 Hello. Can anyone tell me what these Juniper alerts from last night might represent, please? https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631?_g=h@f500dc3&_a=h@262a4cd - I'm investigating an incident ( T371877 ) that may be network related, so I wonder if this could be related. The host in question (an-db1001) is in eqiad A6. Thanks.
[10:14:00] <stashbot>	 T371877: Investigate interruption to postgresql services on an-db1001 - affecting multiple Airflow instances - https://phabricator.wikimedia.org/T371877
[10:20:22] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10044547 (10cmooney)
[10:29:43] <XioNoX>	 btullis: I think that link doesn't work?
[10:33:19] <btullis>	 XioNoX: Ah, sorry. Try this one: https://logstash.wikimedia.org/goto/063568595723743e078fb7feb2c653e6
[10:36:06] <topranks>	 btullis: they are generic alerts firing - the system has N number of "yellow" alarms 
[10:36:24] <topranks>	 we need to dig deeper to see what the "yellow alarms" that generated the alerts were 
[10:38:20] <XioNoX>	 I had a quick look at the cr1-eqiad one but it's not on the router's logs
[10:40:24] <XioNoX>	 only one alert triggered in codfw - https://phabricator.wikimedia.org/T371868
[10:40:47] <XioNoX>	 but I'm a bit puzzled on why icinga reporter any yellow alarms
[10:41:25] <XioNoX>	 btullis: but realistically, no, the yellow alarms have no risk of impacting production traffic
[10:41:29] <btullis>	 OK, thanks. Anything I can do to help? I started to look at LibreNMS, but I don't know my way around it too well. We had about 40 minutes where access to the postgres services on ab-db1001 seemed shaky from multiple clients, but I can't find any evidence of a problem on the host, which is why I started looking at the network.
[10:42:49] <XioNoX>	 you can use this dashboard to filter network devices logs if needed https://logstash.wikimedia.org/app/dashboards#/view/5aec0930-6c94-11eb-b024-07c11958a85f
[10:43:40] <topranks>	 I was looking at cr2-codfw, and similar there is nothing in the logs at around that time that indicates an issue 
[10:44:11] <XioNoX>	 btullis: unfortunately that host is still on the old switches, so we have less data on the network side
[10:44:44] <XioNoX>	 we can see the drop of traffic on its network port, https://librenms.wikimedia.org/device/device=160/tab=port/port=30781/
[10:45:11] <XioNoX>	 but no signs of saturation for example
[10:47:41] <btullis>	 OK, many thanks both. That drop in traffic matches what I saw from the node_exporter as well. I'll go back to looking at postgres logs. 
[11:19:23] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:26:25] <wikibugs>	 10Mail, 06Infrastructure-Foundations: Updating forwarding rules for Jimmy@wikipedia.org. - https://phabricator.wikimedia.org/T371884#10044693 (10Ladsgroup) a:03Ladsgroup Hi, let me come and help you.
[12:03:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[12:17:56] <elukey>	 slyngs: o/
[12:18:06] <slyngs>	 \o
[12:18:10] <elukey>	 I tried to make the same sretest1001 with debmonitor-server on debmonitor1003 stopped, same result..
[12:18:36] <slyngs>	 But why?
[12:19:11] <elukey>	 my theory was that debmonitor1003 was holding some locks on the db, causing the test to fail
[12:20:09] <elukey>	 after adding some prints() to the code, it seems that it fails on some specific package,  tried to delete it from sretest1001 but then it failed on another one
[12:20:27] <slyngs>	 Django doesn't really do that, not locks at least, it could be a race condition writing to the database, but that should not happen when 1003 is down
[12:20:41] <slyngs>	 Does any packages work?
[12:20:53] <elukey>	 yeah exactly race as well, didn't mention it
[12:21:10] <elukey>	 some packages do get through afaics, most of them
[12:21:51] <slyngs>	 Okay, what happens if the client runs twice?
[12:22:14] <slyngs>	 So we delete all packages, run the client, some packages fail, then rerun... do they then work?
[12:23:29] <elukey>	 sometimes yes, but I got also two failures in a row IIRC
[12:23:43] <slyngs>	 The thinking being: We have one package installed, but an upgrade is available
[12:24:42] <slyngs>	 So the package object is created, but in the belly of the get_or_create thingy, when then show up asking for the same package, but a new version and now the package info isn't actually written to the database
[12:25:06] <elukey>	 but only on 2003? 
[12:25:10] <elukey>	 this is the part that doesn't make sense
[12:25:24] <slyngs>	 Oh no, yeah, not that's wrong :-)
[12:25:56] <elukey>	 also I was thinking earlier on that there must have been a reason to not have debmonitor.discovery.wmnet active/active
[12:26:43] <elukey>	 I wanted to try https://gerrit.wikimedia.org/r/c/operations/dns/+/1060094 but at this point I am not sure if it can ever work
[12:27:28] <slyngs>	 It would be interesting to try, but yeah I don't see why it would work
[12:29:23] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Upgrade Nebox to 4.1 - https://phabricator.wikimedia.org/T371889 (10ayounsi) 03NEW
[12:29:44] <slyngs>	 elukey: https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/1051298/3/debmonitor/hosts/views.py <- Maybe this broke it?
[12:29:51] <slyngs>	 It's just a guess
[12:30:22] <slyngs>	 elukey: I just have to do a school run, I'll be back in a bit
[12:30:22] <elukey>	 that one is not yet deployed :(
[12:30:27] <elukey>	 ack! Thanks :)
[12:30:44] <slyngs>	 It's not in 0.5?
[12:31:17] <elukey>	 it is, but we have 0.4 on 2003 and 1003
[12:32:16] <slyngs>	 Ah
[12:51:32] <wikibugs>	 10netbox, 06Infrastructure-Foundations: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890 (10ayounsi) 03NEW p:05Triage→03High
[12:56:35] <slyngs>	 est4
[12:56:45] <slyngs>	 ?
[12:56:57] <slyngs>	 Anyway, back :-)
[12:57:44] <elukey>	 :)
[12:59:04] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10044889 (10ayounsi)
[12:59:15] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#10044887 (10ayounsi) →14Duplicate dup:03T341843
[13:00:27] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Netbox rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10044912 (10ayounsi) 05Stalled→03Open p:05Medium→03High
[13:07:43] <slyngs>	 elukey: At this point I'm tempted to just switch DNS and see what happens
[13:11:57] <wikibugs>	 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10044950 (10ayounsi) 05Open→03Resolved
[13:13:59] <elukey>	 slyngs: I had the same idea but I think it would have the same failure mode no?
[13:14:30] <slyngs>	 I would vew much assume so
[13:16:18] <slyngs>	 Have you noticed that the 2003 seems much much slower
[13:16:31] <slyngs>	 I mean I get that it also fails, but even before that
[13:22:57] <elukey>	 this is a very good point
[13:23:16] <elukey>	 in theory it makes a cross-DC connection everytime
[13:23:21] <elukey>	 to get to the db
[13:23:34] <elukey>	 debmonitor2003 -> m2-master eqiad
[13:23:37] <slyngs>	 ping time for 1003 is 0.4 ms
[13:23:49] <slyngs>	 2003 is 30ms
[13:24:06] <elukey>	 yeah that explains the slowness
[13:24:40] <slyngs>	 But does it explain the eaky breakyn-ess
[13:26:03] <elukey>	 IIUC the client gets a HTTP 500, that seems to be unrelated to the speed
[13:26:16] <elukey>	 but it is indeed the only big difference that we found so far
[13:27:29] <elukey>	 I am reading https://docs.djangoproject.com/en/5.0/ref/databases/#persistent-database-connections
[13:27:48] <elukey>	 lemme test it
[13:29:03] <slyngs>	 Worth a test :-)
[13:30:35] <elukey>	 mmm not entirely sure where to put it in config.json
[13:30:53] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Netbox: Remove leftovers of CAS auth - https://phabricator.wikimedia.org/T371892 (10ayounsi) 03NEW p:05Triage→03Low
[13:30:54] <elukey>	 should I add it under MYSQL or using a DATABASE entry?
[13:31:54] <slyngs>	 Probably under the MYSQL
[13:32:36] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Netbox: Remove leftovers of CAS auth - https://phabricator.wikimedia.org/T371892#10045012 (10SLyngshede-WMF) a:03SLyngshede-WMF
[13:34:27] <slyngs>	 If the config used the python syntax, the MYSQL would be under a "DATABASE" entry anyway
[13:35:49] <elukey>	 I have a horrible suspicion
[13:35:52] <elukey>	 WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='debmonitor.discovery.wmnet', port=443): Read timed out. (read timeout=30)")': /hosts/sretest1001.eqiad.wmnet/update
[13:36:06] <slyngs>	 ooooh, the query takes to long
[13:36:15] <elukey>	 this is on the client
[13:36:31] <elukey>	 so now I think that the client gives up and causes the exception on the server side
[13:37:31] <slyngs>	 REQUEST_TIMEOUT = (3.05, 30)  # (connect, read) see https://docs.python-requests.org/en/master/user/advanced/#timeouts
[13:37:41] <slyngs>	 Bump it to 60 and try?
[13:38:58] <elukey>	 yeah I was reading the same
[13:39:49] <slyngs>	 I'm bumping it and trying
[13:40:35] <elukey>	 slyngs: lemme drop the host data first
[13:40:48] <elukey>	 okdone
[13:41:06] <elukey>	 I think we can modify the client's python directly
[13:41:11] <elukey>	 on sretest1001
[13:41:15] <slyngs>	 That's what I did
[13:41:30] <elukey>	 super
[13:41:33] <elukey>	 lemme know how it goes
[13:44:12] <slyngs>	 60 sek isn't enough, but I think you're right in the sequence of things
[13:44:43] <slyngs>	 It didn't break until the timeout
[13:44:55] <elukey>	 if you think about it, ~500 packages and 30ms each time takes a ton of time
[13:45:07] <elukey>	 and I think it is all due to the CONN_MAX_AGE
[13:45:14] <elukey>	 or, part of it
[13:45:38] <elukey>	 okok now things make more sense :D
[13:45:46] <slyngs>	 I'll just set it to 120 and try again
[13:45:51] <elukey>	 ack
[13:46:06] <slyngs>	 And start a stop watch, just to see
[13:48:02] <elukey>	 https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/debmonitor/+/refs/heads/master/debmonitor/debmonitor/settings/base.py#97
[13:48:17] <elukey>	 so MYSQL in the config is a wrapper for DATABASES
[13:48:47] <elukey>	 as second test we can modify debmonitor-server's base.py on 2003 and restart
[13:50:09] <slyngs>	 I'll just bump the timeout back down
[13:50:19] <elukey>	 did it work?
[13:50:21] <slyngs>	 It took two attempts at 120s to complete the run
[13:50:39] <slyngs>	 INFO:debmonitor:Successfully sent the full update to the DebMonitor server
[13:50:56] <slyngs>	 But two attempts and  the 120s timeout
[13:51:20] <slyngs>	 An then it can just run normally
[13:51:22] <elukey>	 ack wow
[13:51:44] <slyngs>	 Timeout is back at 30
[13:52:41] <elukey>	 testing it now
[13:52:52] <slyngs>	 self.close_at = None if max_age is None else time.monotonic() + max_age
[13:53:36] <slyngs>	 TypeError: unsupported operand type(s) for +: 'float' and 'str'
[13:53:58] <slyngs>	 Maybe try 120.0 ?
[13:54:07] <elukey>	 yes yes fixed
[13:54:20] <elukey>	 even if I am not confident it will work
[13:54:33] <elukey>	 yes exactly
[13:54:35] <elukey>	 so from https://docs.djangoproject.com/en/5.0/ref/databases/#persistent-database-connections
[13:54:50] <elukey>	 IIUC it keeps the connection held open between http requests, and we make one
[13:55:58] <elukey>	 in our code we loop through all the packages, so even if the db conn is open it may take some tens of ms to update the db
[13:56:06] <elukey>	 summing all those up gets to the 30
[13:56:11] <elukey>	 *30s quickly
[13:56:29] <elukey>	 so basically our failover host is not really good atm
[13:57:26] * elukey sigh
[13:58:16] <wikibugs>	 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002#10045152 (10SLyngshede-WMF) 05In progress→03Resolved
[13:59:15] <elukey>	 at this point we have two roads:
[13:59:42] <slyngs>	 And both leads to Rome?
[14:00:16] <elukey>	 1) we rollout debmonitor 0.5 and basically test on 1003, and rollback if anything looks wrong. Not great but debmonitor isn't that state critical, we could recover if there are errors.
[14:00:44] <elukey>	 2) we find a way to make the 2003 node more performant, even if it may take some more code (so a newer release etc..)
[14:00:48] <elukey>	 slyngs: :D
[14:01:13] <elukey>	 if we do 1), then we have to find a solution for the failover anyway, but later
[14:01:45] <slyngs>	 3: Spin up 1004 and upgrade that
[14:02:51] <slyngs>	 But do 2 anyway, because the failover isn't really useful right now
[14:02:52] <elukey>	 right we could do it, but it may be overkill (time wise) just to test the new release.. at this point 2) could be a better investment of time
[14:02:59] <slyngs>	 True
[14:04:16] <slyngs>	 If we upgrade 1003 by manually installing the deb package, not using the repo, then if it does't work we just uninstall the package and apt install debmonitor-server again
[14:05:33] <elukey>	 so 0.5 is already in apt, on 2003 I copied the 0.4 deb from the apt cache and used it to rollback
[14:05:48] <elukey>	 that worked nicely anyway, the rollback is just a dpkg -i
[14:05:53] <elukey>	 so pretty quick
[14:05:53] <slyngs>	 Question: what if we point 2003 to m2-master.codfw.wmnet, rather than m2-master.eqiad.wmnet
[14:06:11] <slyngs>	 Is that dangerous
[14:06:58] <elukey>	 in theory no, but there is no sync between eqiad and codfw afaics and we'd get into a split brain scenario
[14:07:13] <slyngs>	 Let's not do that then
[14:07:47] <elukey>	 the nature of debmonitor should, in theory, guarantee that at some point the codfw db would converge to eqiad (after say a couple of days of traffic)
[14:09:24] <elukey>	 and we'd be really dc independent, we are not now
[14:09:58] <elukey>	 for example, in a DC switchover we'd really switch DC for debmonitor too
[14:11:39] <elukey>	 kinda like this option
[14:17:05] <elukey>	 slyngs: I'd be in favor of upgrading 1003 and in case rollback, an architect a new 2003 config as second step. wdyt?
[14:18:10] <slyngs>	 What would the new config entail ?
[14:18:19] <slyngs>	 Otherwise: Yes
[14:19:13] <elukey>	 slyngs: still to be understood, I like your idea about using the local codfw db, but then the failover procedure would need to be investigated
[14:19:46] <slyngs>	 I feel it fairly safe to upgrade
[14:20:05] <elukey>	 ack doing so, I am not proud of it but this is blocking also the rollout of the new client etc..
[14:20:12] <elukey>	 proceeding, thanks a lot for the help
[14:20:21] <slyngs>	 Anytime
[14:21:04] <slyngs>	 I kinda wonder if the whole debmonitor-server processing should be a background process, rather than attempting to do it all in a single request
[14:25:31] <elukey>	 this is a good point
[14:25:47] <elukey>	 or at least, using the ORM to update all at once, not package-by-package
[14:26:07] <elukey>	 ok debmonitor-server rolled out, and our test worked fine
[14:26:13] <slyngs>	 Nice
[14:26:27] <elukey>	 I'll leave it running for today, and tomorrow I'll start to rollout the new debmonitor-client
[14:26:38] <elukey>	 and now I am going to open a task with our discoveries..
[14:26:48] <elukey>	 slyngs: I owe you a big one!
[14:27:22] <slyngs>	 Well you figured it out :-)
[14:27:52] <elukey>	 couldn't do it without you, I'd still be banging my head against the wall :)
[15:06:42] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10045418 (10elukey) The issue is described in T371899. I proceeded anyway to upgrade both debmonitor server hosts, all good so far.  Next step: upgrade the debm...
[15:06:43] <elukey>	 created https://phabricator.wikimedia.org/T371899 to summarize what happened
[15:14:02] <wikibugs>	 10netbox, 06Infrastructure-Foundations: Upgrade Netbox to 4.1 - https://phabricator.wikimedia.org/T371889#10045464 (10Aklapper)
[15:19:23] <jinxer-wm>	 FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:57:09] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10045702 (10cmooney) Just to update on the situation things remain stable since the changes earlier on. ` cmooney@cloudsw1-d5-eqiad> show bgp summary | match "^[0-9]"     10.64....
[16:03:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[16:58:20] <wikibugs>	 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10046062 (10ssingh) 05Open→03Resolved We have upgraded all DNS boxes, Wikimedia DNS and durum hosts to the latest version of anycast-he...
[17:54:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[18:54:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[19:35:48] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange
[19:55:41] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[20:03:43] <jinxer-wm>	 FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop
[20:17:12] <wikibugs>	 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10046623 (10bd808) Here is the developer account record: ` $ ldap uid=jenkins-deploy dn: uid=jenkins-deploy,ou=people,dc=wik...
[20:41:31] <wikibugs>	 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10046726 (10Dzahn) Given that users are always supposed to use different keys for prod vs cloud, should the system user also...
[20:54:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:59:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:54:23] <jinxer-wm>	 FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:36:03] <jinxer-wm>	 FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange