[15:36:07] bd808: swfrench-wmf: I'm revising the PHP upgrade checklist further to also include the Patchdemo and Beta scap issue we encountered after dropping PHP 8.1. [15:36:19] looking at T411235, I'm not sure what actually had to change. [15:36:20] T411235: Beta cluster scap using php8.1 container; php8.2 is now required - https://phabricator.wikimedia.org/T411235 [15:36:30] From what I can tell the "fix" was already in Scap upstream but not yet deployed. [15:36:34] But what was the fix? [15:37:09] and more specifically, what does it depend on / where can I stick it in the checklist. [15:37:37] Krinkle: the issue would have been that Beta was using an old version of scap, which defaulted to using a PHP 8.1-based image to run maintenance scripts as part of the deployment process. [15:37:58] whereas the latest scap has already switched to using an 8.3-based image [15:38:11] catching up Beta to a more recent version of scap was the fix [15:39:21] swfrench-wmf: hm.. what kind of image? this is an image from the production docker registry? [15:39:25] so, the process change would be: once scap switches to using the new PHP version for those use cases, scap in Beta needs updated as well to reflect that change *before* the old PHP version is no longer used [15:39:39] I assume that image does not contain mediawiki itself, but is something more generic? [15:40:28] correct, yeah - it's scap's `mediawiki_runtime_image`: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/f51f4ab7c93c48100b37546ebcb519931d3c885f/scap/config.py#L120 [15:41:37] i.e., it serves as an appropriate-PHP-version "base" image from which scap can run MediaWiki maintenance scripts with /srv/mediawiki mounted into the container [15:42:04] Is this a beta-only thing because it has to re-use part of Scap whilst also syncing to baremetal? Or is this step also used as-is in prod when building images? [15:42:27] I woudl have assumed that prod runs these prep steps inside the dockerfile, when building the image, using the code/versions in that image. [15:44:07] so, the same functionality is used in production: updating `mediawiki_runtime_image` and deploying a new scap is how we shift these use cases to use the new PHP version [15:44:34] ah, I see where the potentially confusing bit is: so, these steps are performed *outside* the MediaWiki image build process, as a pre-step [15:44:42] they used to use the on-host PHP installation [15:44:58] but that changed in order to support wmf/next images built off the deployment host IIRC [15:45:23] this pre-step is basically preparation that happens on the host's /srv/mediawiki-staging [15:45:48] I see. OK. I'll leave my confusion about why we don't (or can't/shouldn't) do this inside the build, but fair enough, we got there incrementally and maybe there's a reason it's better this way. [15:46:10] So I'm thinking of perhaps moving this step earlier in the process so that there isn't a later "oh and update beta" late in the process. [15:46:11] then various MediaWiki images are built by rsync'ing that path on the host into the image, and committing that layer [15:46:19] right now, I think you did it after the rollout completed in prod. [15:46:51] would it be okay to do this as part of prep, once the next image is availalbe, after beta/CI/dev have switched, shortly before the first % rollout? [15:51:55] so, the motivation behind putting it later in the migration is that we want to have high confidence that the scripts (e.g., rebuildLocalisationCache.php) will work as expected on the new version, since producing incorrect output content would have a wide blast radius (due to the shared nature across PHP versions) [15:52:39] if that's not an issue we're likely to encounter in practice, it could certainly shift earlier [15:53:15] Yeah, so there's some osmosis there in terms of general confidence, but the time passing does not give exposure to rebuildLocalisationCache.php and its unique code paths afaik we didn't run that anywhere on PHP 8.3 until the switch. [15:54:50] actually, no, I'm wrong. We use rebuildLocalisationCache.php (or rather, the underlying LocalisationCache methods) in dev env and CI all the time on web requests. It's disabled in prod in favor of this script (manualRecache) [15:55:21] I think switching that in Scap, after dev and CI are on PHP 8.3 for a few weeks and passing, and after proactive testing on WikimediaDebug/next and Beta Cluster is completed, should suffice. [15:55:52] If it isn't, then we should probalby have some kind of test for it that we do trust, because I don't think the rollout does that for us. [15:56:05] got it - yeah, if you don't think the delay is useful, then shifting it earlier in the process sounds good. FWIW, the only testing I'm aware of is the ad-hoc testing done as part of the preparing the scap change - see, e.g., https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/1044#note_175429. [15:56:07] I agree it's a big exposed switch [15:57:29] in any case, the key process improvement w.r.t. Beta is that someone needs to update scap there :) [15:58:00] i.e., when `mediawiki_runtime_image` is updated to switch to the new PHP version [15:58:16] Yeah. If we can make the step where I'm propsing it, that can be a single bullet point (change Scap to use PHP X.Y, and update Scap package in prod and beta cluster) [16:00:51] So the assurances we will have: 1) CI passing on PHP 8.3 which excercises LocalisationCache, fallback/inheritence between languages, CDB reads etc. 2) dev and and beta generally operating on PHP 8.3, 3) completed manual testing on WikimediaDebug and Beta Cluster, 4) Scap logstash checks for MW/PHP errors on the new l10n cache, 5) Scap swagger health checks for various endpoints to return HTTP 200 including non-English test. [16:01:23] I haven't checked number 4 and 5 since before the move to k8s, so these are assumptions on my part that we haven't lost those. [16:03:30] https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/219 "swagger checks only happen for bare metal canaries at this time. A ticket will be filed to deal with that for mw-on-k8s." [16:03:42] https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/230 "Removed the code to perform swagger checks on canaries, which has been deemed to be of low value these days." [16:03:45] I guess that answers that. [16:05:01] So that means when logstash is lagged/delayed, scap will be oblivious to increases in warnigns/errors. [16:06:23] so, what we currently have are the httpbb checks that run in the testserver phase, and then the logstash checks that run in the canary phase. I doubt the former has much sensitivity to l10n issues (though I may be mistaken). [16:06:54] Ah right, we didn't use to run httpbb but now we do, that's quite similar indeed. [16:07:15] that can do HTTP status code checks as well as response body checks, right? [16:07:32] yes [16:07:40] yes, one can assert response body context match (e.g., regexp) [16:08:32] ok, well, I was wrong about swagger checks covering non-English. [16:08:32] https://github.com/wikimedia/operations-mediawiki-config/blob/1e6f68a2dabbe7802db142f9421b15f90381ad2c/docroot/wikipedia.org/spec.yaml#L4 [16:08:45] we host that in a shared docroot so it is limited to stuff taht works on all domains equally. [16:09:39] but https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/files/httpbb/appserver/test_main.yaml covers several non-English projects [16:10:06] not any interface messages, but that should be okay. Asserging that would be prone to failures anyway when the messages change and/or modified on-wiki. [16:10:21] MW should fail when that breaks, and the rest is covered by PHPUnit in CI. [16:11:41] ah, cool - this was the part I wasn't sure about. i.e., whether simply exercising some non-english projects would be sufficient (without explicitly testing interface messages, which as you note would be brittle). [16:12:37] alight, combined with the fact that we tie merging the scap change to manual testing in train-dev anyway, I think I'm happy with shifting this earlier. [16:14:28] thanks. Is Scap release a separate step from Scap upgrades in practice? Or is this semi-automated such that you generally build the next semver package and deploy it right away? [16:16:30] once the MR is merged, it's usually a coordinated effort between SRE and RelEng (typically d.ancy). actually upgrading scap on the deployment hosts is a manual process. [16:17:35] (building the semver release is an automated workflow kicked off manually, IIRC) [16:18:13] which is to say, it's the normal scap release / upgrade process, just with a bit more coordination [16:19:56] okay