[07:10:15] denisse|m: It depends on the context, but for example: [07:10:15] https://github.com/wikimedia/puppet/blob/production/modules/httpd/manifests/init.pp#L98 [07:10:15] then $modules is set https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/netmon/httpd.pp#L2 [07:10:15] or https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/miscweb/httpd.pp#L7 [08:07:16] Just a heads up: nginx on apt.wikimedia.org has be replaced by Apache2. Everything seems to work, but let me know if you find something that doesn't [08:09:46] slyngs: out of curiosity.. what triggered that change? [08:13:24] slyngs: from -operations alert.. a quick check with openssl s_client shows that apt.wm.o isn't sending OCSP stapled data: "OCSP response: no response sent" [08:16:39] vgutierrez: We needed to add the private apt repo, and moritzm suggested moving to Apache at the same time, I think to reduce the number of different webservers [08:16:51] I'll take a look at the OSCP [08:19:03] vgutierrez: Yeah, okay, I didn't get that enable. I'll great a patch for that. Sorry [08:25:35] slyngs: AFAIK we don't do OCSP stapling with Apache2 [08:25:58] No, it's not in the default SSL options [08:26:20] Is there some reason why we won't? [08:26:26] or don't? [08:31:22] cause Apache2 doesn't provide support for pre-fetched OCSP stapling data AFAIK [08:38:00] I'll just try enabling it manually on the standby. I can find anything that says that prefetch isn't support [08:40:24] so you can point apache2 to the OCSP response provided by acme-chief? [08:41:03] otherwise apache2 will read the OCSP endpoint info from the X509 certificate and try to fetch that on runtime [08:42:08] Ah, that way, that I did not check [08:45:18] (IIRC mod_md should provide better support for OCSP stapling, but we don't use it in here afaik - https://httpd.apache.org/docs/2.4/mod/mod_md.html) [08:49:01] Are we concerned about speed, or is it due to certificate renewal that Apaches stapling support isn't sufficient? [08:51:22] slyngs: consistency and performance [08:51:42] performance meaning that we don't want requests being blocked while a OCSP response status is fetched [08:52:10] and consistency meaning that we aim for being able to provide OCSP response on every request that hits the https service [08:52:43] Then it would have made more sense to add the private repo to nginx :-) [08:53:06] I just want to test something, and then I might have to roll the patch back [08:53:16] we don't have LDAP support for nginx AFAIK [08:53:54] so in this case we should disable OCSP stapling checks on apt.wm.o and assume that we cannot provide OCSP stapling for apt.wm.o [08:55:22] Or just move the entire thing to nginx... I just want to check if SSLStaplingForceURL accepts local files, it says it takes a URI, file:// is a URI [08:56:06] hmm that's a OCSP responder URI [08:56:27] so Apache will try to establish a TCP connection against that [08:58:29] slyngs: https://github.com/apache/httpd/blob/ea2c84a0e33d7f2564e77633f8a6c56046fa1618/modules/ssl/ssl_util_stapling.c#L533 [08:58:47] https://www.irccloud.com/pastebin/ku3f9m33/ [08:59:00] http is the only supported schema for that URI [08:59:31] I could make Apache serve it's own ocsp file... but that's kinda weird [08:59:41] :? [09:01:28] So apache configuration validator doesn't see the problem with the file:/// but you're right, it won't work [09:02:30] sure.. it's a valid URI [09:02:39] but not a supported one for OCSP responder endpoints [09:05:35] I'll check with Moritz and figure out if we want OCSP or Apache the most. It's pretty trivial to just move the private repo to nginx instead [09:09:16] meanwhile please ack the icinga alert [09:13:12] Will do [09:26:55] So the plan is to rollback, and them move the private repo to nginx as well. [09:37:40] moritzm: ok to merge your ganeti changes? [09:38:15] yes, please! [09:39:15] done! [14:23:08] <_joe_> is there a way to get systemd to restart multiple units at once using wildcards? [14:24:32] something like `systemctl restart nova-*` generally works at least for me, as long as that wildcard doesn't match any files [14:24:43] <_joe_> that's solved by escaping it [14:25:05] <_joe_> but apparently it only accepts final wildcards [14:25:16] <_joe_> so nova-* works but nova-*.service does not [15:25:07] Hi SREs! Can someone restart `keyholder-proxy.service` on deploy1002 ? [15:25:31] (the config was updated but it doesn't automatically re-read it) [15:25:51] dancy: done [15:25:55] Thanks! [15:26:12] And new access confirmed. Much appreciated! [15:26:34] np <3 [15:50:29] XioNoX: Thanks Arzhel! [16:08:19] mutante: rzl: did you see latest issue on -operations? Was similar to the one last week, but in another rack [16:09:21] I think I will copy and paste from https://wikitech.wikimedia.org/wiki/Incidents/2022-06-21_asw-a2-codfw_accidental_power_cycle unless you disagree [16:09:59] sounds right to me [16:10:58] even if impact was very low, it could be useful e.g. to inform if to do the same thing on eqiad (as it would have been more impactful there) [17:40:24] Done here: https://wikitech.wikimedia.org/wiki/Incidents/2022-06-30_asw-a4-codfw_accidental_power_cycle [20:34:29] dancy: thcipriani: _joe_ I've merged today's opcache-like issue with T254209 and added some analysis. Looks like we may not be out of the woods yet. Maybe my unproven theory of "the issue is in how opcache populates itself" has some truth after all, with it happening after resets merely being a statistical likelihood due to more writes happening when it's clear. Although it remains an even bigger mystery to me how it can be that the [20:34:29] server even recovers on its own after e.g. an hour of failing requests with no manual or scheduled restart nor opcache threshold being reached. [20:34:30] T254209: Spike of impossible "Cannot declare class" fatal errors - https://phabricator.wikimedia.org/T254209 [20:35:38] <_joe_> Krinkle: the servers did not recover in the past, IIRC they always needed manual intervention [20:35:55] <_joe_> having said that, I've not seen today's issue in any depth [20:36:35] <_joe_> uhm this is an issue I've encoutered in the past [20:36:52] I agree, the ones that look like off-by-one always required a restart, e.g. undefined variable or undefined class [20:38:12] <_joe_> yeah this issue I've already seen and indeed it went away by itself before I could pinpoint the problem [20:38:34] <_joe_> it's a different type of problem that is also opcache-related probably [20:38:37] <_joe_> I agree with that [20:39:02] <_joe_> but I don't think that it's one of those "abruptly a class name changes when it should not" [20:41:46] Indeed. It's reporting its name correctly, and it's reporting that not only are we corectly trying to declare it, it's claiming it has in fact already been declared before this point. [20:42:32] but it seems like a problem that just generally breaks PHP completely as we only see this error for the first class in a process. [20:42:52] The first class we try to declare in a web req process used to be XWikimediaDebug via autoprepend->profiler. [20:43:05] since I've moved the Profiler code in a class last month, the first class we defne is now Profiler. [20:43:12] and it now happens for that file. There is literlaly no code before it.