[01:30:38] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:30:38] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:22:33] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) [06:36:20] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [06:47:31] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) [06:48:37] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) pc2011 is no longer a master, this can be done anytime as the host isn't used. [06:49:54] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A6 from asw-a6-codfw to lsw1-a6-codfw - https://phabricator.wikimedia.org/T355866 (10Marostegui) es2024 is no longer a master. [07:24:42] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) db2104 is no longer a master [07:24:54] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [07:25:05] 10netops, 10DBA, 10Infrastructure-Foundations, 10SRE, 10ops-codfw: Migrate servers in codfw rack A5 from asw-a5-codfw to lsw1-a5-codfw - https://phabricator.wikimedia.org/T355864 (10Marostegui) [09:19:33] 10netops, 10Ganeti, 10Infrastructure-Foundations, 10SRE, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Manually configuring IPv6 is straightforward as well once we know a couple points : When enabling forwarding on an interface (for example... [09:30:38] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:01] slyngs, moritzm https://debmonitor.wikimedia.org/hosts/ is a 500 [09:32:22] same for all the internal pages I'm trying [09:32:37] yep, all of them [09:33:37] moritzm saw the same thing, it's related to the session cookie or the mod_auth_cas cookie [09:35:00] We haven't found a better way to deal with it than going in with the browser developer tool (inspector in Firefox) and removing those two cookies) [09:36:07] why the homepage works? [09:36:45] That I also don't know, from the django perspective everything return a 200 (or 302) [09:39:13] Yeah, all your requests are 200 as well from debmonitor [09:41:48] I did a pass with incognito and all seems to work fine indeed [09:42:30] but I guess most SRE will have those cookies, can't we solve it any other way? overwriing them? [09:42:58] with a force refresh now I got it working on my existing session [09:43:04] no manual change on cookies [09:43:18] but some page still fails, weird [09:43:39] is it something time-based? [09:44:17] I believe so, because mine just worked, but I also didn't sign in to CAS before after the switchover [09:45:32] my 500 request doesn't get to debmonitor's main.log at all [09:45:38] proxy-server/500 [09:45:56] in debmonitor.wikimedia.org-access.log [09:48:44] That would make the 500 an Apache thing [09:49:04] it doesn't get proxied at all [09:49:10] but I don't see any error log with more details [09:49:48] I'm fail certain at this point that it's mod_auth_cas, but that a hard sell without any evidence [09:49:54] fairly [09:51:21] What happens if you sign out of CAS on idp.wikimedia.org [09:53:38] I tried that before, didn't make a difference [09:56:53] I don't want to loose my repro :D [09:57:00] and have a meeting in 3m [09:57:09] Sign out of debmonitor first, and the IDP? [09:58:58] I also to exactly understand why you have to sign in to Debmonitor with both CAS and LDAP. Medium term I suggest just switching to OIDC and removing LDAP and CAS [09:59:49] debmonitor has no suppor for cas/oidc, it predates all of that and was having direct support for LDAP login [10:00:04] but if we want to upload it to debian we need to keep multiple login opions available [10:00:15] Sure, but I have to login with both each time [10:01:07] yep, missing feature [10:01:51] Fair :-) I think we can add python-social-auth fairly easily, but that's a separate issue from your current problem [10:06:42] IF you have time after your meeting we can try enabling CAS debugging [10:06:57] ack [10:08:23] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:18:57] 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:49:25] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:10:23] slyngs: FYI refreshing the page opened the page now, then I clicked on Images and worked, then back to Source Packages and I got the 500 again [12:10:33] sorry the meeting went longer than expected [12:10:51] and I'm about to go for lunch [12:11:25] Enjoy, we can debug when you get back. I really don't understand what triggers the 500 error, if not CAS [12:12:39] k, later [12:15:38] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service Failed on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:26] slyngs: I'm around whenever you are [13:13:50] 2 minutes, just fixing some puppet code I broke [13:15:46] Okay, ready [13:21:49] volans: Okay, debug mode is one [13:21:52] on [13:24:33] ok [13:24:52] got a 500 now [13:24:56] (3rd request) [13:25:42] https://debmonitor.wikimedia.org/images/ <- This one [13:26:11] no https://debmonitor.wikimedia.org/source-packages/ [13:26:15] image worked [13:26:24] Got it: [Thu Feb 01 13:24:48.390893 2024] [proxy_http:debug] [pid 183496:tid 139798752712384] mod_proxy_http.c(1901): [client 127.0.0.1:43998] AH01113: HTTP: declining URL uwsgi://127.0.0.1:8001/source-packages/, referer: https://debmonitor.wikimedia.org/images/ [13:27:31] correct, source-packages coming from images [13:29:31] It looks like it is declined, but then the request is sent to uwsgi anyway [13:29:54] uh? [13:33:20] https://phabricator.wikimedia.org/P56063 [13:34:24] line 31 - 34 [13:37:53] mmmmh [13:39:07] The old configuration of the uwsgi has both socket and http-socket [13:39:13] I wonder if that makes a difference [13:40:05] I don't know what changed with the new setup [13:40:16] so hard for me to guess what could be the culprit [13:40:33] do you know if when it says declining it replies 500? [13:41:14] No, the log doesn't say [13:41:23] Oh, the configuration [13:41:29] source code? [13:41:30] :D [13:44:22] I'll look it up. Could you give it a try again ? I added the http-socket in, but I can't see why that should make a difference [13:44:50] sure [13:45:09] done [13:45:16] Error right? [13:45:28] usual 3 request (refresh works, clock on images works, click back on source-packages fails) [13:50:50] Okay, I feel like I'm getting closer :-) [13:51:17] AH01113: HTTP: declining URL uwsgi://127.0.0.1:8001/source-packages/ <- It should be using the socket [13:58:40] Just for the fun of it, would you try again? [14:00:40] sure :D [14:00:50] ready? [14:00:54] done [14:01:22] WHYYYY! [14:01:24] fwiw if I open a new tab and paste https://debmonitor.wikimedia.org/source-packages/ it works [14:01:33] then I click on images and works [14:01:37] But not if you reload in that tab? [14:01:38] then I click on source-packages and fails [14:02:15] I'd did find someone online who got the same error code: https://phabricator.wikimedia.org/T278612 <- which is a little funny [14:03:23] lol [14:05:38] I also have zero idea as to why it worked for moritzm when he delete those two cookies [14:05:38] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service Failed on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:56] XioNoX: ^ is that you? [14:07:08] yep, fixe [14:07:09] d [14:07:51] I can't seems to repro in incognito though [14:08:49] No, neither could Moritz [14:09:25] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service Failed on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:13:03] volans: I've updated the uwsgi configuration, minus a few paths it is now EXACTLY like the old one, same for the apache config [14:13:38] Nope [14:13:44] ready for repro? [14:13:52] Yes [14:14:11] oooh no repro, le me insist [14:14:46] Still AH01113: HTTP: declining URL uwsgi://127.0.0.1:8001/source-packages/, referer: https://debmonitor.wikimedia.org/images/ [14:14:51] yeah, it worked fine in a new private tab right away and without any glitches [14:14:56] it's working [14:15:04] WHY! [14:15:13] lol no ida :D [14:15:15] *idea [14:15:19] but is working [14:15:35] THE MAIN DIFFERENCE: buffer-size=8192 [14:15:36] no more 500s for me [14:15:41] I'm clocking around as much as I can [14:15:44] *clicking [14:16:23] volans: Have you checked your buffers this morning, might they be to large good sir? [14:16:36] lol [14:16:38] that rings a bell [14:16:52] we bumped that for the env var used by mod_cas: https://phabricator.wikimedia.org/T275599 [14:17:11] Oh... [14:17:14] so if we lost that setting in the new bookworm config this would explain [14:17:37] but wouldn't that fail on all requests? [14:17:42] https://phabricator.wikimedia.org/T275599 [14:18:11] I would expect, [14:18:23] sorry, wrong paste: https://gerrit.wikimedia.org/r/c/operations/puppet/+/667132 [14:21:48] nice find! [14:21:55] and good memory :D [14:22:35] I am a little concerned with the speed of which moritz can find stuff in Phabricator. I think he may have an offline copy and just use grep [14:22:45] https://gerrit.wikimedia.org/r/c/operations/puppet/+/995043 [14:23:01] If I had good memory, I would have thought about it this morning :-) [14:23:34] maybe he is running his own LLM against Phabricator [14:23:41] PhabGPT [14:23:45] Does Emacs have an LLM? [14:23:57] It probably does, doesn't it [14:24:27] PhabGPT :-) [14:24:40] * sukhe gets the domain name [14:25:00] .org or .ai [14:25:16] slyngs: Wikimedia style, I will get all and park them :) [14:28:22] volans: I've rolled all the debugging off and we're left with the original config, plus the increased buffer. Would you just give it a test? [14:31:37] sure [14:31:58] so far so good [14:32:07] I'll let you know if i get any 500 [14:36:30] nice! [18:10:38] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:10:38] (SystemdUnitFailed) firing: generate_os_reports.service Failed on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed