[00:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [00:04:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [04:04:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:24:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:47:32] 10netops, 06Infrastructure-Foundations: cr2-codfw - Host 0 ECC single bit parity error - https://phabricator.wikimedia.org/T371868 (10ayounsi) 03NEW p:05Triage→03Low [07:19:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [08:42:50] 10netbox, 06Infrastructure-Foundations: Test Netbox-More-Metrics plugin on Netbox 4.0 - https://phabricator.wikimedia.org/T365989#10044332 (10ayounsi) 05Open→03Resolved testing done! [10:03:30] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879 (10cmooney) 03NEW p:05Triage→03High [10:14:00] Hello. Can anyone tell me what these Juniper alerts from last night might represent, please? https://logstash.wikimedia.org/app/dashboards#/view/8b1907c0-2062-11ec-85b7-9d1831ce7631?_g=h@f500dc3&_a=h@262a4cd - I'm investigating an incident ( T371877 ) that may be network related, so I wonder if this could be related. The host in question (an-db1001) is in eqiad A6. Thanks. [10:14:00] T371877: Investigate interruption to postgresql services on an-db1001 - affecting multiple Airflow instances - https://phabricator.wikimedia.org/T371877 [10:20:22] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10044547 (10cmooney) [10:29:43] btullis: I think that link doesn't work? [10:33:19] XioNoX: Ah, sorry. Try this one: https://logstash.wikimedia.org/goto/063568595723743e078fb7feb2c653e6 [10:36:06] btullis: they are generic alerts firing - the system has N number of "yellow" alarms [10:36:24] we need to dig deeper to see what the "yellow alarms" that generated the alerts were [10:38:20] I had a quick look at the cr1-eqiad one but it's not on the router's logs [10:40:24] only one alert triggered in codfw - https://phabricator.wikimedia.org/T371868 [10:40:47] but I'm a bit puzzled on why icinga reporter any yellow alarms [10:41:25] btullis: but realistically, no, the yellow alarms have no risk of impacting production traffic [10:41:29] OK, thanks. Anything I can do to help? I started to look at LibreNMS, but I don't know my way around it too well. We had about 40 minutes where access to the postgres services on ab-db1001 seemed shaky from multiple clients, but I can't find any evidence of a problem on the host, which is why I started looking at the network. [10:42:49] you can use this dashboard to filter network devices logs if needed https://logstash.wikimedia.org/app/dashboards#/view/5aec0930-6c94-11eb-b024-07c11958a85f [10:43:40] I was looking at cr2-codfw, and similar there is nothing in the logs at around that time that indicates an issue [10:44:11] btullis: unfortunately that host is still on the old switches, so we have less data on the network side [10:44:44] we can see the drop of traffic on its network port, https://librenms.wikimedia.org/device/device=160/tab=port/port=30781/ [10:45:11] but no signs of saturation for example [10:47:41] OK, many thanks both. That drop in traffic matches what I saw from the node_exporter as well. I'll go back to looking at postgres logs. [11:19:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:25] 10Mail, 06Infrastructure-Foundations: Updating forwarding rules for Jimmy@wikipedia.org. - https://phabricator.wikimedia.org/T371884#10044693 (10Ladsgroup) a:03Ladsgroup Hi, let me come and help you. [12:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [12:17:56] slyngs: o/ [12:18:06] \o [12:18:10] I tried to make the same sretest1001 with debmonitor-server on debmonitor1003 stopped, same result.. [12:18:36] But why? [12:19:11] my theory was that debmonitor1003 was holding some locks on the db, causing the test to fail [12:20:09] after adding some prints() to the code, it seems that it fails on some specific package, tried to delete it from sretest1001 but then it failed on another one [12:20:27] Django doesn't really do that, not locks at least, it could be a race condition writing to the database, but that should not happen when 1003 is down [12:20:41] Does any packages work? [12:20:53] yeah exactly race as well, didn't mention it [12:21:10] some packages do get through afaics, most of them [12:21:51] Okay, what happens if the client runs twice? [12:22:14] So we delete all packages, run the client, some packages fail, then rerun... do they then work? [12:23:29] sometimes yes, but I got also two failures in a row IIRC [12:23:43] The thinking being: We have one package installed, but an upgrade is available [12:24:42] So the package object is created, but in the belly of the get_or_create thingy, when then show up asking for the same package, but a new version and now the package info isn't actually written to the database [12:25:06] but only on 2003? [12:25:10] this is the part that doesn't make sense [12:25:24] Oh no, yeah, not that's wrong :-) [12:25:56] also I was thinking earlier on that there must have been a reason to not have debmonitor.discovery.wmnet active/active [12:26:43] I wanted to try https://gerrit.wikimedia.org/r/c/operations/dns/+/1060094 but at this point I am not sure if it can ever work [12:27:28] It would be interesting to try, but yeah I don't see why it would work [12:29:23] 10netbox, 06Infrastructure-Foundations: Upgrade Nebox to 4.1 - https://phabricator.wikimedia.org/T371889 (10ayounsi) 03NEW [12:29:44] elukey: https://gerrit.wikimedia.org/r/c/operations/software/debmonitor/+/1051298/3/debmonitor/hosts/views.py <- Maybe this broke it? [12:29:51] It's just a guess [12:30:22] elukey: I just have to do a school run, I'll be back in a bit [12:30:22] that one is not yet deployed :( [12:30:27] ack! Thanks :) [12:30:44] It's not in 0.5? [12:31:17] it is, but we have 0.4 on 2003 and 1003 [12:32:16] Ah [12:51:32] 10netbox, 06Infrastructure-Foundations: pynetbox incompatibility with Netbox >= 4.0.6 - https://phabricator.wikimedia.org/T371890 (10ayounsi) 03NEW p:05Triage→03High [12:56:35] est4 [12:56:45] ? [12:56:57] Anyway, back :-) [12:57:44] :) [12:59:04] 10netbox, 06Infrastructure-Foundations: Netbox report test_mgmt_dns_hostname - rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10044889 (10ayounsi) [12:59:15] 10netbox, 06Infrastructure-Foundations: Netbox: capirca.getHosts script runs into timeout - https://phabricator.wikimedia.org/T358339#10044887 (10ayounsi) →14Duplicate dup:03T341843 [13:00:27] 10netbox, 06Infrastructure-Foundations: Netbox rq.timeouts.JobTimeoutException - https://phabricator.wikimedia.org/T341843#10044912 (10ayounsi) 05Stalled→03Open p:05Medium→03High [13:07:43] elukey: At this point I'm tempted to just switch DNS and see what happens [13:11:57] 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: New hosts with "Netbox status: unknown" - https://phabricator.wikimedia.org/T371653#10044950 (10ayounsi) 05Open→03Resolved [13:13:59] slyngs: I had the same idea but I think it would have the same failure mode no? [13:14:30] I would vew much assume so [13:16:18] Have you noticed that the 2003 seems much much slower [13:16:31] I mean I get that it also fails, but even before that [13:22:57] this is a very good point [13:23:16] in theory it makes a cross-DC connection everytime [13:23:21] to get to the db [13:23:34] debmonitor2003 -> m2-master eqiad [13:23:37] ping time for 1003 is 0.4 ms [13:23:49] 2003 is 30ms [13:24:06] yeah that explains the slowness [13:24:40] But does it explain the eaky breakyn-ess [13:26:03] IIUC the client gets a HTTP 500, that seems to be unrelated to the speed [13:26:16] but it is indeed the only big difference that we found so far [13:27:29] I am reading https://docs.djangoproject.com/en/5.0/ref/databases/#persistent-database-connections [13:27:48] lemme test it [13:29:03] Worth a test :-) [13:30:35] mmm not entirely sure where to put it in config.json [13:30:53] 10netbox, 06Infrastructure-Foundations: Netbox: Remove leftovers of CAS auth - https://phabricator.wikimedia.org/T371892 (10ayounsi) 03NEW p:05Triage→03Low [13:30:54] should I add it under MYSQL or using a DATABASE entry? [13:31:54] Probably under the MYSQL [13:32:36] 10netbox, 06Infrastructure-Foundations: Netbox: Remove leftovers of CAS auth - https://phabricator.wikimedia.org/T371892#10045012 (10SLyngshede-WMF) a:03SLyngshede-WMF [13:34:27] If the config used the python syntax, the MYSQL would be under a "DATABASE" entry anyway [13:35:49] I have a horrible suspicion [13:35:52] WARNING:urllib3.connectionpool:Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='debmonitor.discovery.wmnet', port=443): Read timed out. (read timeout=30)")': /hosts/sretest1001.eqiad.wmnet/update [13:36:06] ooooh, the query takes to long [13:36:15] this is on the client [13:36:31] so now I think that the client gives up and causes the exception on the server side [13:37:31] REQUEST_TIMEOUT = (3.05, 30) # (connect, read) see https://docs.python-requests.org/en/master/user/advanced/#timeouts [13:37:41] Bump it to 60 and try? [13:38:58] yeah I was reading the same [13:39:49] I'm bumping it and trying [13:40:35] slyngs: lemme drop the host data first [13:40:48] okdone [13:41:06] I think we can modify the client's python directly [13:41:11] on sretest1001 [13:41:15] That's what I did [13:41:30] super [13:41:33] lemme know how it goes [13:44:12] 60 sek isn't enough, but I think you're right in the sequence of things [13:44:43] It didn't break until the timeout [13:44:55] if you think about it, ~500 packages and 30ms each time takes a ton of time [13:45:07] and I think it is all due to the CONN_MAX_AGE [13:45:14] or, part of it [13:45:38] okok now things make more sense :D [13:45:46] I'll just set it to 120 and try again [13:45:51] ack [13:46:06] And start a stop watch, just to see [13:48:02] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/debmonitor/+/refs/heads/master/debmonitor/debmonitor/settings/base.py#97 [13:48:17] so MYSQL in the config is a wrapper for DATABASES [13:48:47] as second test we can modify debmonitor-server's base.py on 2003 and restart [13:50:09] I'll just bump the timeout back down [13:50:19] did it work? [13:50:21] It took two attempts at 120s to complete the run [13:50:39] INFO:debmonitor:Successfully sent the full update to the DebMonitor server [13:50:56] But two attempts and the 120s timeout [13:51:20] An then it can just run normally [13:51:22] ack wow [13:51:44] Timeout is back at 30 [13:52:41] testing it now [13:52:52] self.close_at = None if max_age is None else time.monotonic() + max_age [13:53:36] TypeError: unsupported operand type(s) for +: 'float' and 'str' [13:53:58] Maybe try 120.0 ? [13:54:07] yes yes fixed [13:54:20] even if I am not confident it will work [13:54:33] yes exactly [13:54:35] so from https://docs.djangoproject.com/en/5.0/ref/databases/#persistent-database-connections [13:54:50] IIUC it keeps the connection held open between http requests, and we make one [13:55:58] in our code we loop through all the packages, so even if the db conn is open it may take some tens of ms to update the db [13:56:06] summing all those up gets to the 30 [13:56:11] *30s quickly [13:56:29] so basically our failover host is not really good atm [13:57:26] * elukey sigh [13:58:16] 10CAS-SSO, 10netbox, 06Infrastructure-Foundations, 13Patch-For-Review: Move Netbox authentication to python-social-auth - https://phabricator.wikimedia.org/T308002#10045152 (10SLyngshede-WMF) 05In progress→03Resolved [13:59:15] at this point we have two roads: [13:59:42] And both leads to Rome? [14:00:16] 1) we rollout debmonitor 0.5 and basically test on 1003, and rollback if anything looks wrong. Not great but debmonitor isn't that state critical, we could recover if there are errors. [14:00:44] 2) we find a way to make the 2003 node more performant, even if it may take some more code (so a newer release etc..) [14:00:48] slyngs: :D [14:01:13] if we do 1), then we have to find a solution for the failover anyway, but later [14:01:45] 3: Spin up 1004 and upgrade that [14:02:51] But do 2 anyway, because the failover isn't really useful right now [14:02:52] right we could do it, but it may be overkill (time wise) just to test the new release.. at this point 2) could be a better investment of time [14:02:59] True [14:04:16] If we upgrade 1003 by manually installing the deb package, not using the repo, then if it does't work we just uninstall the package and apt install debmonitor-server again [14:05:33] so 0.5 is already in apt, on 2003 I copied the 0.4 deb from the apt cache and used it to rollback [14:05:48] that worked nicely anyway, the rollback is just a dpkg -i [14:05:53] so pretty quick [14:05:53] Question: what if we point 2003 to m2-master.codfw.wmnet, rather than m2-master.eqiad.wmnet [14:06:11] Is that dangerous [14:06:58] in theory no, but there is no sync between eqiad and codfw afaics and we'd get into a split brain scenario [14:07:13] Let's not do that then [14:07:47] the nature of debmonitor should, in theory, guarantee that at some point the codfw db would converge to eqiad (after say a couple of days of traffic) [14:09:24] and we'd be really dc independent, we are not now [14:09:58] for example, in a DC switchover we'd really switch DC for debmonitor too [14:11:39] kinda like this option [14:17:05] slyngs: I'd be in favor of upgrading 1003 and in case rollback, an architect a new 2003 config as second step. wdyt? [14:18:10] What would the new config entail ? [14:18:19] Otherwise: Yes [14:19:13] slyngs: still to be understood, I like your idea about using the local codfw db, but then the failover procedure would need to be investigated [14:19:46] I feel it fairly safe to upgrade [14:20:05] ack doing so, I am not proud of it but this is blocking also the rollout of the new client etc.. [14:20:12] proceeding, thanks a lot for the help [14:20:21] Anytime [14:21:04] I kinda wonder if the whole debmonitor-server processing should be a background process, rather than attempting to do it all in a single request [14:25:31] this is a good point [14:25:47] or at least, using the ORM to update all at once, not package-by-package [14:26:07] ok debmonitor-server rolled out, and our test worked fine [14:26:13] Nice [14:26:27] I'll leave it running for today, and tomorrow I'll start to rollout the new debmonitor-client [14:26:38] and now I am going to open a task with our discoveries.. [14:26:48] slyngs: I owe you a big one! [14:27:22] Well you figured it out :-) [14:27:52] couldn't do it without you, I'd still be banging my head against the wall :) [15:06:42] 10SRE-tools, 06Infrastructure-Foundations: Allow debmonitor to store the Debian version-id in the OS field - https://phabricator.wikimedia.org/T368744#10045418 (10elukey) The issue is described in T371899. I proceeded anyway to upgrade both debmonitor server hosts, all good so far. Next step: upgrade the debm... [15:06:43] created https://phabricator.wikimedia.org/T371899 to summarize what happened [15:14:02] 10netbox, 06Infrastructure-Foundations: Upgrade Netbox to 4.1 - https://phabricator.wikimedia.org/T371889#10045464 (10Aklapper) [15:19:23] FIRING: [3x] SystemdUnitFailed: netbox_report_coherence_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:57:09] 10netops, 06Infrastructure-Foundations, 06SRE: cloudsw1-d5-eqiad instability Aug 6 2024 - https://phabricator.wikimedia.org/T371879#10045702 (10cmooney) Just to update on the situation things remain stable since the changes earlier on. ` cmooney@cloudsw1-d5-eqiad> show bgp summary | match "^[0-9]" 10.64.... [16:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [16:58:20] 10netops, 06Infrastructure-Foundations, 06SRE, 06Traffic: Upgrade anycast-healthchecker to 0.9.8 (from 0.9.1-1+wmf12u1) - https://phabricator.wikimedia.org/T370068#10046062 (10ssingh) 05Open→03Resolved We have upgraded all DNS boxes, Wikimedia DNS and durum hosts to the latest version of anycast-he... [17:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:35:48] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange [19:55:41] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:43] FIRING: [2x] SystemdUnitCrashLoop: rq-netbox.service crashloop on netbox1002:9100 - TODO - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitCrashLoop [20:17:12] 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10046623 (10bd808) Here is the developer account record: ` $ ldap uid=jenkins-deploy dn: uid=jenkins-deploy,ou=people,dc=wik... [20:41:31] 10CAS-SSO, 10Beta-Cluster-Infrastructure, 10Bitu, 06cloud-services-team, and 2 others: Wikitech system account and SUL for Jenkins agents? - https://phabricator.wikimedia.org/T371930#10046726 (10Dzahn) Given that users are always supposed to use different keys for prod vs cloud, should the system user also... [20:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:59:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:54:23] FIRING: [4x] SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:36:03] FIRING: PuppetConstantChange: Puppet performing a change on every puppet run on netbox1002:9100 - https://puppetboard.wikimedia.org/nodes?status=changed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetConstantChange