[00:09:18] Krinkle: ooh I forgot about that [00:10:21] * ori found https://wikitech.wikimedia.org/wiki/Performance/Runbook/Puppet_patches#Beta_Cluster_testing [00:11:02] Thats it :) [00:12:14] thank you! [01:11:04] why I try to ssh to the beta cluster, I'm getting warnings about possible DNS spoofing - how do I check if I am getting the right key? [01:14:01] https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints should have it [01:43:41] https://github.com/grafana/loki ooooooooh [02:16:55] AntiComposite thanks. But primary.bastion.wmflabs.org isn't listed there [02:17:18] wait thats the same as just .wmcloud.org [02:17:29] (in terms of the ECDSA key I'm getting) [02:20:42] but then it asks about the fingerprint for deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud (which is what I'm trying to connect to) with SHA256:52RYyM81OIrUEot/L2i9FtkFxoEyhikIMRwSLXL7+N8 but I don't see that host listed on the page [03:24:25] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:58:25] 10Continuous-Integration-Infrastructure, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10DannyS712) [04:17:35] 10Continuous-Integration-Infrastructure, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime) p:05Triageβ†’03Unbreak! Looks like {T308943} again..? Raising to UBN 😣 [04:18:14] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime) [04:18:29] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:25:09] re T309371, need someone to restart zuul per https://phabricator.wikimedia.org/T308943#7947453 [04:25:09] T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 [05:53:46] DannyS712: you can get the deployment-prep fingerprints from https://config-master.wikimedia.beta.wmflabs.org/ [06:15:59] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10elukey) On contint1001 I see the following in `/var/log/zuul/merger-debug.log`: ` 2022-05-27 04:36:23,233 DEBUG zuul... [06:24:47] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10Majavah) Indeed looks like the same issue as last time. The [[ https://logstash.wikimedia.org/app/dashboards#/view/AW... [06:41:24] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime) [06:41:40] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime) [06:41:43] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime) [06:41:59] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime) [06:42:03] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos - https://phabricator.wikimedia.org/T308943 (10TheresNoTime) [07:27:30] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10SLyngshede-WMF) I've restarted Zuul on contint2001, and that seems to have helped a bit. The Zuul service on contin... [07:34:20] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime) p:05Unbreak!β†’03Triage Thanks @SLyngshede-WMF! That seems to have sorted it πŸ˜„ (//no longer UBN//) [07:43:39] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10SLyngshede-WMF) [07:43:44] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10SLyngshede-WMF) 05Openβ†’03Resolved a:03SLyngshede-WMF [08:50:57] hi folks! [08:51:12] is there a quick way to force https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/800025 to publish the docker image to the registry? [08:51:21] Or should I run manually the jenkins job? [08:56:09] elukey: ah that was affected by the gerrit bug? maybe try a "rebuild" on https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-articlequality/87/console ? [09:02:45] (that is the sum total of my suggestions :-P) [09:08:02] 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime) I did a bit of digging into what metrics are available in [[ https://wikitech.wikimedia.org/wiki/Prometheus | Prometheus ]] for this, so an [[ h... [09:12:07] TheresNoTime: o/ I already tried but that is not the job that publishes to the docker registry, I'll try to see if I can kick off the right one [09:18:16] good luck! ^^ [10:11:53] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto) [10:13:32] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto) [11:28:07] <_joe_> elukey: just publish a null patch [11:29:18] <_joe_> the alternative is to log into jenkins, and re-run that jobs [11:29:20] <_joe_> *job [11:29:37] <_joe_> if it even made it to jenkins [11:39:29] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto) [11:45:56] elukey: the publish pipeline apparently failed due to Γ  merge conflict [11:47:19] Even though the patch got merged by gerrit. It is probably an issue with the the zuul merger that hamdled the request [11:47:58] I am not there today to investigate, but maybe i will remember about it tonight :) [12:46:20] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto) [12:55:02] _joe_ yeah I wanted to do it but I was hoping to have something to quickly re-run, rather than gathering parameters for the jenkins job :) [12:55:35] hasharAway: thanks! yeah there was an issue with zuul earlier on [13:26:39] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto) [13:39:02] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto) [13:45:48] 10Project-Admins: Create project tag for <#DSE-K8S> - https://phabricator.wikimedia.org/T309095 (10JArguello-WMF) Hi @Aklapper ! Is there any other information we need to provide for the project tag? Thank you very much for your help. [13:58:22] 10Project-Admins: Create project tag for <#DSE-K8S> - https://phabricator.wikimedia.org/T309095 (10Aklapper) 05Openβ†’03Resolved a:03Aklapper Hi, requested public project #DSE-Kubernetes-Cluster has been created: https://phabricator.wikimedia.org/project/view/5959/ (In case you need to edit the project or p... [13:58:39] 10Project-Admins: Create project tag for DSE-Kubernetes-Cluster (DSE-K8S) - https://phabricator.wikimedia.org/T309095 (10Aklapper) [14:03:33] 10Project-Admins: Create project tag for DSE-Kubernetes-Cluster (DSE-K8S) - https://phabricator.wikimedia.org/T309095 (10JArguello-WMF) Thank you so much for your help @Aklapper ! [15:34:31] 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen), 10Wikidata, 10Wikidata Query UI, and 2 others: Update wikidata-query-gui-build job from Node 12 to Node 14 - https://phabricator.wikimedia.org/T308579 (10Lucas_Werkmeister_WMDE) Indeed, [build #45](https://integration.wikimedia.org/ci/job/... [15:39:47] 10Gerrit, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech: wikidata-query-gui-build doesn’t work when latest commit is by dependabot (commit-msg hook adds Change-Id in wrong place) - https://phabricator.wikimedia.org/T295601 (10Lucas_Werkmeister_WMDE) >>! In T295601#7500334, @Lucas_Werkmeister_WMDE wrote: > T... [15:42:07] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 10Patch-For-Review, 10Release, 10Train Deployments: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 (10dancy) 05Openβ†’03Resolved [16:05:30] 10Release-Engineering-Team (🌱 Spring Cleaning β€” April 2022), 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10dancy) >>! In T299648#7907880, @dancy wrote: > @Joe Regarding https://gerrit.wikimedia.o... [16:19:29] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 10Release, 10Train Deployments: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 (10Jdlrobson) [16:35:46] 10Release-Engineering-Team (🌱 Spring Cleaning β€” April 2022): Delete wmf branches from Gerrit repositories - https://phabricator.wikimedia.org/T303828 (10Krinkle) I currently have the following aliases ([dotfiles repo](https://github.com/Krinkle/dotfiles/blob/v2022.05/gitconfig#L51-L70)): ` # Wildcard deletion... [16:42:40] hasharAway: want to collab next week and finish T247653 ? [16:42:41] T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653 [16:42:52] (to unbreak OOUI demos which need php 7.2+) [16:46:14] 10Continuous-Integration-Infrastructure, 10OOUI: Demos page for OOUI in php is broken - https://phabricator.wikimedia.org/T297035 (10Krinkle) a:03Krinkle [17:25:58] Krinkle: that one is overdue indeed. Thursday would work for me [18:00:50] Okay! [18:22:06] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests: Request for Gerrit Managers permissions for karapayneWMDE - https://phabricator.wikimedia.org/T302262 (10Majavah) Pinging @QChris who's been taking care of most repository requests. I believe the diffusion/github mirrors need to be created manually and t... [18:34:44] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 10Release, 10Train Deployments: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 (10Jdlrobson) [18:50:16] (03CR) 10Krinkle: [C: 03+2] Look for mw:moduleStyles meta tag in Parsoid output as well [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/794774 (owner: 10Subramanya Sastry) [18:51:24] (03Merged) 10jenkins-bot: Look for mw:moduleStyles meta tag in Parsoid output as well [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/794774 (owner: 10Subramanya Sastry) [18:51:45] (03Merged) 10jenkins-bot: Fix arwiki Cite CSS [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/795721 (owner: 10Subramanya Sastry) [20:32:07] Project beta-update-databases-eqiad build #58881: 04FAILURE in 12 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58881/ [20:32:08] Project beta-code-update-eqiad build #393492: 04FAILURE in 9 min 7 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393492/ [20:42:58] 10Beta-Cluster-Infrastructure: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) [20:43:54] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) p:05Triageβ†’03High [20:48:01] TheresNoTime: I can reboot it. [20:48:13] dancy: if you wouldn't mind :) [20:48:37] I'm only a "user" (though I'm going to log a task to get that changed now) [20:49:28] !log Initiated hard reboot of deployment-deploy03.deployment-prep [20:49:29] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:50:06] ok it's back [20:50:18] 10Beta-Cluster-Infrastructure: Grant `Samtar` admin access to the deployment-prep project - https://phabricator.wikimedia.org/T309415 (10TheresNoTime) [20:50:37] dancy: thank you! :D [20:50:55] TheresNoTime, I think you could have rebooted it yourself through cumin [20:51:11] ssh in wasn't working [20:51:27] ^ :(( [20:51:32] ssh to cumin was working, and pinging deploy03 was also working [20:51:37] unplug-and-replug may also work :P [20:51:58] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10dancy) I rebooted using the horizon UI. [20:51:59] presumably when cumin then tries to ssh to deploy03 it would hang [20:52:11] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) 05Openβ†’03Resolved a:03dancy @dancy rebooted `deployment-deploy03` and it is now accessible [20:52:23] maybe, doing it through horizon definetly doesn't hurt [20:53:00] !log `sudo wmcs-openstack role add --user samtar --project deployment-prep projectadmin` (T309415) [20:53:02] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [20:53:02] T309415: Grant `Samtar` admin access to the deployment-prep project - https://phabricator.wikimedia.org/T309415 [20:53:05] hey at least https://github.com/theresnotime/jenkins-watch got a real world test \o/ [20:54:06] https://usercontent.irccloud-cdn.com/file/buJ7cVlg/image.png [20:54:23] Fancy [20:54:25] Yippee, build fixed! [20:54:25] Project beta-code-update-eqiad build #393493: 09FIXED in 4 min 3 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393493/ [20:54:28] oh and thanks for doing that bd808 :) [20:55:05] TheresNoTime: now you are obligated to fix all the problems ;) [20:55:43] have fun with this mess called beta cluster :p [20:55:47] 10Beta-Cluster-Infrastructure, 10User-bd808: Grant `Samtar` admin access to the deployment-prep project - https://phabricator.wikimedia.org/T309415 (10TheresNoTime) 05Openβ†’03Resolved a:03bd808 [20:55:56] Project beta-scap-sync-world build #52918: 04FAILURE in 1 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52918/ [20:57:41] zabe: I am cursed with finding everything "quite interesting", so spend my time jumping between {things} [20:57:45] :P [20:58:40] Project beta-scap-sync-world build #52919: 04STILL FAILING in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52919/ [20:59:42] aw heck, "Load key "/etc/keyholder.d/mwdeploy.pub": invalid format" --> "mwdeploy@deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud: Permission denied (publickey)." [20:59:57] ^ for the sync :/ [21:00:56] Hmm.. looks like someone needs to arm the keyholder. [21:01:06] I don't know who usually does that. [21:01:46] I tried running `keyholder arm` but it wants a passphrase [21:05:29] dancy: hmmm... I thought the password was in ~root, but I'm not seeing it there. [21:05:49] Project beta-scap-sync-world build #52920: 04STILL FAILING in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52920/ [21:06:59] the password is in puppetmaster, let me run keyholder arm [21:08:13] zabe: have done! [21:08:42] (or followed https://wikitech.wikimedia.org/wiki/Keyholder at least) [21:09:33] I ran 'sudo keyholder arm' and typed in all keys, maybe we have done it twice now [21:10:24] Yippee, build fixed! [21:10:24] Project beta-scap-sync-world build #52921: 09FIXED in 1 min 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52921/ [21:10:50] !log zabe@deployment-deploy03:~$ sudo keyholder arm [21:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:11:00] well no harm no foul, at least its working \o/ [21:11:01] sigh. Honestly deployment-prep's keyholder passwords should all be on that wiki page directly. and the should all be the same. [21:11:53] in the long ago they were all passphrase free, but I think something was changed in keyholder that made it barf on that (insecure private keys) [21:14:16] as said, it's a mess Β―\_(ツ)_/Β― [21:15:49] I will wander back to 'real' work before I end up trying to fix all the busted windows and really just making things worse ;) [21:17:39] I once spent some time on making a bunch of them have the same passphrase .. in production. [21:18:09] seems like you found they are not the same in beta but somewhere in /var/lib/git/labs/private/files/ssh/tin/ on the beta puppetmaster [21:18:29] also that "tin" in there is not a thing anymore. that was the name of the prod deployment server many years ago [21:19:13] I think we had 'deployment-tin' in beta for about 3 years after tin was decommed in prod :) [21:19:58] heh, yea. we shouldn't even use the numbers in the hostnames I guess. so more like "deployment-deploy" :p [21:20:16] ^^ yes, that was where I got they passphrases from [21:20:28] ok, good, at least it armed it [21:20:39] there is some "keyholder status" as well [21:21:49] see that table on the wiki page? how a lot of them are "deployment-key-passphrase"? once every single line had a differnt passphrase :P [21:22:02] imagine that.. 20 different passwords until you got them all armed [21:22:09] o_o [21:22:28] just saying .. prod unified them..so beta can too [21:22:41] well..mostly unified [21:24:22] Project beta-update-databases-eqiad build #58882: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58882/ [21:24:23] Project beta-code-update-eqiad build #393497: 04FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393497/ [21:25:09] huh [21:25:38] same issue [21:26:01] zabe: all yours [21:28:08] I tried cumin out of curiosity, but it is actually not working [21:29:46] o.o it *seems* to just be `deploy03`? Is it worth trashing & rebuilding it? [21:29:59] its "just" a deployment host right..? [21:30:06] yes [21:30:29] there will be probably be puppet errors when you apply the role to a fresh instance [21:30:42] but only one way to find out [21:31:20] if quota doesn't get in your way.. I would first make a the new one before touching an old one [21:31:42] that way you have something to compare too [21:33:16] 10Beta-Cluster-Infrastructure: Grant Zabe admin access to deployment-prep - https://phabricator.wikimedia.org/T309419 (10Zabe) [21:34:19] Oh I thought you already were a member zabe ! [21:34:34] I don't have access to horizon [21:34:57] so yeah, I am a member, but not an admin [21:37:53] I'm trying to connect to deploy03 but after I entered my ssh key nothing is happening, and when I tried to manually ping the server I got 3 time outs and a destination net unreachable [21:38:03] is this being worked on? [21:38:16] I am now looking at it [21:38:44] it will probably work if you reboot the instance [21:38:58] try 'soft reboot' [21:40:41] zabe: you should have horizon access with just your wikitech user. you may have to select the deployment-prep project from a dropdown though to switch context [21:42:32] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) 05Resolvedβ†’03Open a:05dancyβ†’03TheresNoTime Issue repeated, looking at it now [21:42:44] ah, yes [21:43:13] but it's in a read-only mode if I see it correctly [21:44:19] !log `sudo wmcs-openstack role add --user zabe --project deployment-prep projectadmin` (T309419) [21:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:44:22] T309419: Grant Zabe admin access to deployment-prep - https://phabricator.wikimedia.org/T309419 [21:44:35] bd808, thanks :) [21:44:45] !log hard rebooted deployment-deploy03 as soft reboot unresponsive [21:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:45:07] zabe: np. I think you will need to log out/login to horizon for it to see your new powers [21:45:28] 10Beta-Cluster-Infrastructure, 10User-bd808: Grant Zabe admin access to deployment-prep - https://phabricator.wikimedia.org/T309419 (10bd808) 05Openβ†’03Resolved a:03bd808 [21:45:39] yep [21:47:23] Still unable to SSH into it - going to try a "rebuild", agree? [21:47:25] Project beta-update-databases-eqiad build #58883: 04STILL FAILING in 3 min 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58883/ [21:48:02] or would shutting it down, and spinning up a fresh VM with the puppet roles be smarter, to keep that old VM there? [21:48:33] Yippee, build fixed! [21:48:33] Project beta-code-update-eqiad build #393498: 09FIXED in 4 min 31 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393498/ [21:48:44] ... [21:48:52] I can ssh now [21:50:00] Project beta-scap-sync-world build #52923: 04FAILURE in 1 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52923/ [21:50:06] afair it was "admins can create instances" and members can't. but members can still ssh to instances? [21:50:41] ah yep, can SSH in now \o/ still, twice this has happened, probably still worth rebuilding/spinning up a new instance ...? [21:50:45] mutante, yes, basically members have full root access to the hosts but can't manage them through horizon [21:50:52] ^ that [21:52:21] TheresNoTime, if it is not happening again (let's hope), I would leave it. If it is happening again, we can try that, but then please create a task for that ;) [21:52:45] sounds good :) [21:55:00] "May 27 21:47:21 deployment-deploy03 php: PHP Fatal error: Out of memory (allocated 7487094784) (tried to allocate 20480 bytes) in /srv/mediawiki-staging/php-master/extensions/WikiLambda/includes/ZObjectFactory.php on line 158" -- Not sure what's going on there on the deploy server but maybe related to it locking up/getting slow [21:56:32] maybe try restarting the php-fpm service [21:57:15] `root ttyS0 Fri May 27 20:49 - crash (00:53)` is in the `last -5 reboot shutdown root` [21:59:10] Project beta-scap-sync-world build #52924: 04STILL FAILING in 3 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52924/ [21:59:56] I see both php-fpm7.2 and php-fpm7.4 processes running and have no idea why that would be. Or honestly why a deployment server is running php-fpm at all. [22:01:32] there is no php-fpm at all running on deploy1002.eqiad.wmnet [22:02:29] eh, yea, one with the version number but I wasn't sure what version it runs [22:02:37] both at the same time..sounds wrong [22:03:57] true, prod deploy server does not run it.. but does have php packages installed [22:04:29] https://gerrit.wikimedia.org/g/operations/puppet/+/fddf4a9a7d104eb05edf65fccf0de3c5b5ec700c/hieradata/cloud/eqiad1/deployment-prep/common.yaml#115 [22:04:36] ^ it's explictly enabled there [22:05:30] and I think it's not explictly disabled for the deployment host in deployment-prep [22:05:32] the problem seems to be that it's in common [22:05:34] yeah, php needs to be installed to run some parts of scap but I don't know why there would be a web php service on a deployment box. But this is back in the rabbit hole I said I would not diven down [22:05:39] which enables it for every instance in the project [22:05:45] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) p:05Highβ†’03Triage a:05TheresNoTimeβ†’03None [22:05:46] Project beta-scap-sync-world build #52925: 04STILL FAILING in 53 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52925/ [22:06:25] my guess is at some point in the past deployment servers included the same base as appservers and then it changed [22:07:34] or it never did but then this should have never been put in the "common.yaml". hiera should be role based..and if that doesn't work in beta then prefix based [22:07:56] TheresNoTime, you wanna run keyholder arm, or should I? [22:08:07] zabe: just started doing it [22:08:12] ok :) [22:09:18] !log samtar@deployment-deploy03:~$ sudo keyholder arm [22:09:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:10:35] yeah, deployment-prep puppet configuration needs some love (like the rest of the deployment-prep infrastructure) [22:21:00] Yippee, build fixed! [22:21:00] Project beta-scap-sync-world build #52926: 09FIXED in 6 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52926/ [22:23:19] Project beta-update-databases-eqiad build #58884: 04STILL FAILING in 3 min 19 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58884/ [22:32:13] zabe: OOMing, trying to cancel that database job [22:32:45] Project beta-update-databases-eqiad build #58885: 15ABORTED in 4 min 48 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58885/ [22:32:49] PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND [22:32:49] 43 root 20 0 0 0 0 S 100.0 0.0 0:21.99 kswapd0 [22:32:49] 32551 www-data 20 0 7422512 6.8g 0 D 16.7 87.5 0:23.69 php [22:36:39] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) While running a step of `beta-update-databases-eqiad`, we go OOM and unresponsive: ` PID USER PR NI VIRT RES SHR S %CPU %MEM... [22:38:33] Project beta-code-update-eqiad build #393503: 15ABORTED in 5 min 33 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393503/ [22:42:50] There is this wikilambda patch, which might cause the database update to lag: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/798987 [22:42:51] but not sure [22:49:02] !log manually running database update script: samtar@deployment-deploy03:~$ /usr/local/bin/wmf-beta-update-databases.py [22:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:49:21] Project beta-code-update-eqiad build #393504: 15ABORTED in 37 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393504/ [22:53:10] Project beta-code-update-eqiad build #393505: 15ABORTED in 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393505/ [22:55:52] !log zabe@deployment-mwmaint02:~$ mwscript extensions/WikiLambda/maintenance/updateTypedLists.php --wiki=wikifunctionswiki --db # started ~20 min ago [22:55:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [22:58:08] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) a:03TheresNoTime [23:01:00] Okay that database update worked, but took a long time [23:01:49] yeah I manually ran that migration script [23:02:16] you wanna try kicking beta-update-databases-eqiad [23:02:17] ? [23:02:50] going to let https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/393506/console go through first [23:07:02] running https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/58886/console [23:13:20] If it is not going to work, we probably need to create a task, because somehow wikilambda seems to always checks all z objects, wether they are migrated and if not migrate them. That takes ages. [23:15:56] Yippee, build fixed! [23:15:57] Project beta-update-databases-eqiad build #58886: 09FIXED in 9 min 9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58886/ [23:15:58] well I don't want to jinx it, but it appears to have gotten further than it did before [23:16:00] oooh [23:17:16] nice, the wikilambda thing "only" needed ~8 min to check all items, which seems to be just fine. And it does it without running out of memory. [23:21:49] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10Zabe) FTR, it seems like beta-update-databases-eqiad was running out of memory while trying to perform the migration added in https://gerrit.wikimedia.org/r/c/med... [23:36:10] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) [23:38:12] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) [23:40:05] 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime)