[00:08:23] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10Krinkle) I don't think we need per-jobtype control here. From my perspective, the important points here are: 1. we can switch PHP verisons by cookie for appservers in... [06:20:11] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) In hiera there is profile::mediawiki::php::php_versions, which is an array of PHP versions like [ "7.2", "7.4" ]. The default is the first element in the ar... [06:53:50] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) I tested swapping the versions in horizon hiera. This caused puppet to break PHP-FPM by swapping the port numbers -- php7.4-fpm failed to start because port... [08:09:16] image-suggestion has started 404'ing on /, is that known/expected ? [09:19:35] hnowlan: ^^ [09:25:34] godog: I think it has more or less always done that, is it causing issues? [09:27:14] hnowlan: I noticed because its network probes have started failing yesterday at ~13 UTC, no issues per-se [09:27:36] however if 404 is expected then the 'probes' section in service::catalog needs to be adjusted [09:27:57] I'll find an example [09:28:53] e.g. 'inference' has valid_status_codes: - 404 [09:29:35] aha, I think I deployed a version of it around then but that shouldn't have changed afaik [09:29:40] godog: ahh good to know [09:29:55] I could also set `path: /healthz` for something a bit more representative [09:30:19] hnowlan: yeah that'd be even better I think [09:31:43] I was checking sal earlier but found no recent deploys searching 'suggestion', is that expected ? [09:31:49] latest I can see is may 25th [09:31:53] https://sal.toolforge.org/production?p=0&q=suggestion&d= [09:33:10] That's pretty weird... it's a standard helmfile pattern, just used it yesterday: hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: sync [09:33:21] (although weirdly that was at 1500 UTC, not 1300) [09:34:18] curious indeed, looks like those messages didn't make it to sal [09:34:28] I can see them on my irc backlog alright [09:34:55] stashbot isn't acking them though [09:35:12] I'll slowly back away from that rabbit hole for now, can see recent entries for 'apply' [09:35:45] anyways hnowlan feel free to send reviews my way re: service catalog probes! happy to take a look [09:35:53] godog: will do, thanks! [10:15:24] oh, oops - ~1300 is almost certainly when I changed the service catalogue state so that makes sense. [10:18:53] 10serviceops, 10Generated Data Platform, 10Image-Suggestions, 10SRE, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) This is pretty much done. We currently only have two main metrics for the service so there's a very ba... [10:20:42] 10serviceops, 10Generated Data Platform, 10Image-Suggestions, 10SRE, and 3 others: New Service Request Generated Datasets: Image Suggestions Service - https://phabricator.wikimedia.org/T304891 (10hnowlan) 05Open→03Resolved a:03hnowlan [10:30:31] 10serviceops: PendingDeprecationWarning on update_version.py - https://phabricator.wikimedia.org/T310133 (10TheresNoTime) [12:03:43] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) A workaround for this is to just increment the port number each time the version set is changed. ` profile::mediawiki::php::php_versions: - '7.4' - '7.2' p... [12:05:23] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Jelto) 05Resolved→03Open puppet runs on the test instance `gitlab-prod-1001` fail with ` Error: /File[/var/lib/puppet/facts.d]: Failed to generate additional... [12:43:26] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10tstarling) >>! In T295578#7987463, @Krinkle wrote: > presumably doesn't have a built-in way to randomly set a header for a % of traffic Right, I think that would be a... [12:53:45] 10serviceops, 10Patch-For-Review: Test running php7.2 and php7.4 in parallel on the beta cluster - https://phabricator.wikimedia.org/T295578 (10JMeybohm) Yes, the idea was to have jobqueue send the header so it can be set per job type and would not require reconfiguration of the servers (as the PHP_VERSION coo... [15:18:48] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review: Migrate gitlab-test instance to puppet - https://phabricator.wikimedia.org/T297411 (10Dzahn) Does this only affect this instance or maybe all users who have a local puppetmaster in their VPS project? It seems like we haven't touched anything and it... [15:39:08] 10serviceops, 10Beta-Cluster-Infrastructure, 10SRE, 10Scap, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) Noting the following settings from the deployment-prep horizon project puppet config page: ` profile:... [15:54:54] 10serviceops, 10Beta-Cluster-Infrastructure, 10SRE, 10Scap, 10Release-Engineering-Team (Seen): Scap can't clear opcache on mw servers in Beta Cluster - https://phabricator.wikimedia.org/T237033 (10dancy) I'm going to change profile::mediawiki::php::restarts::ensure to true and see how things go. [20:22:42] 10serviceops, 10Deployments, 10Wikimedia-production-error: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225 (10Krinkle) https://sal.toolforge.org/production?p=0&q=mw1415&d= > 2022-05-09: > * 10serviceops, 10SRE, 10ops-eqiad: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 21:13 < mutante> !log mw1415 - scap pull, restart apache, /usr/local/sbin/restart-php7.2-fpm (INFO: The server is depooled from all services. Restarting the service directly) [21:45:29] 10serviceops, 10SRE, 10ops-eqiad: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) This caused T310225 because setting it to pooled=inactive does not mean monitoring will stop checking it and when this came back unexpectedly it caused new alerts for 500s on... [21:46:19] 10serviceops, 10SRE, 10ops-eqiad: mw1415 (canary appserver) is down, incl. mgmt - https://phabricator.wikimedia.org/T307755 (10Dzahn) 05In progress→03Resolved a:03Dzahn [21:46:27] 10serviceops, 10Deployments, 10Wikimedia-production-error: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225 (10Dzahn) mw1415 does not service 500s anymore. T307755#7990623 [21:50:51] 10serviceops, 10Deployments, 10Wikimedia-production-error: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225 (10Dzahn) What happened here is: The machine died on May 5th. Ticket was opened with dcops to... [22:40:06] 10serviceops, 10Deployments, 10Wikimedia-production-error: mw1415 fatals due to serving responses from 1.39.0-wmf.10 (was DBQueryError: Unknown column page_restrictions) - https://phabricator.wikimedia.org/T310225 (10dancy) 05Open→03Resolved a:03dancy Thanks for the summary @Dzahn .