[00:09:18] <ori>	 Krinkle: ooh I forgot about that
[00:10:21] * ori found https://wikitech.wikimedia.org/wiki/Performance/Runbook/Puppet_patches#Beta_Cluster_testing
[00:11:02] <Krinkle>	 Thats it :)
[00:12:14] <ori>	 thank you!
[01:11:04] <DannyS712>	 why I try to ssh to the beta cluster, I'm getting warnings about possible DNS spoofing - how do I check if I am getting the right key?
[01:14:01] <AntiComposite>	 https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints should have it
[01:43:41] <TheresNoTime>	 https://github.com/grafana/loki ooooooooh
[02:16:55] <DannyS712>	 AntiComposite thanks. But primary.bastion.wmflabs.org isn't listed there
[02:17:18] <DannyS712>	 wait thats the same as just .wmcloud.org
[02:17:29] <DannyS712>	 (in terms of the ECDSA key I'm getting)
[02:20:42] <DannyS712>	 but then it asks about the fingerprint for deployment-deploy03.deployment-prep.eqiad1.wikimedia.cloud (which is what I'm trying to connect to) with  SHA256:52RYyM81OIrUEot/L2i9FtkFxoEyhikIMRwSLXL7+N8 but I don't see that host listed on the page
[03:24:25] <icinga-wm>	 PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:58:25] <wikibugs>	 10Continuous-Integration-Infrastructure, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10DannyS712)
[04:17:35] <wikibugs>	 10Continuous-Integration-Infrastructure, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime) p:05Triage→03Unbreak! Looks like {T308943} again..? Raising to UBN 😣
[04:18:14] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime)
[04:18:29] <icinga-wm>	 RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:25:09] <TheresNoTime>	 re T309371, need someone to restart zuul per https://phabricator.wikimedia.org/T308943#7947453
[04:25:09] <stashbot>	 T309371: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371
[05:53:46] <taavi>	 DannyS712: you can get the deployment-prep fingerprints from https://config-master.wikimedia.beta.wmflabs.org/
[06:15:59] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10elukey) On contint1001 I see the following in `/var/log/zuul/merger-debug.log`:  ` 2022-05-27 04:36:23,233 DEBUG zuul...
[06:24:47] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10Majavah) Indeed looks like the same issue as last time. The [[ https://logstash.wikimedia.org/app/dashboards#/view/AW...
[06:41:24] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime)
[06:41:40] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime)
[06:41:43] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime)
[06:41:59] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime)
[06:42:03] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: CI fails with 'This change or one of its cross-repo dependencies was unable to be automatically merged' for a lot of repos - https://phabricator.wikimedia.org/T308943 (10TheresNoTime)
[07:27:30] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10SLyngshede-WMF) I've restarted Zuul on contint2001, and that seems to have helped a bit.   The Zuul service on contin...
[07:34:20] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10TheresNoTime) p:05Unbreak!→03Triage Thanks @SLyngshede-WMF! That seems to have sorted it 😄 (//no longer UBN//)
[07:43:39] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10SLyngshede-WMF)
[07:43:44] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team, 10User-DannyS712: Gerrit: all patches are being reported as merge conflicts - https://phabricator.wikimedia.org/T309371 (10SLyngshede-WMF) 05Open→03Resolved a:03SLyngshede-WMF
[08:50:57] <elukey>	 hi folks!
[08:51:12] <elukey>	 is there a quick way to force https://gerrit.wikimedia.org/r/c/machinelearning/liftwing/inference-services/+/800025 to publish the docker image to the registry?
[08:51:21] <elukey>	 Or should I run manually the jenkins job?
[08:56:09] <TheresNoTime>	 elukey: ah that was affected by the gerrit bug? maybe try a "rebuild" on https://integration.wikimedia.org/ci/job/trigger-inference-services-pipeline-articlequality/87/console ?
[09:02:45] <TheresNoTime>	 (that is the sum total of my suggestions :-P)
[09:08:02] <wikibugs>	 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team: gerrit-bot holding open SSH sessions - https://phabricator.wikimedia.org/T309376 (10TheresNoTime) I did a bit of digging into what metrics are available in [[ https://wikitech.wikimedia.org/wiki/Prometheus | Prometheus ]] for this, so an [[ h...
[09:12:07] <elukey>	 TheresNoTime: o/ I already tried but that is not the job that publishes to the docker registry, I'll try to see if I can kick off the right one
[09:18:16] <TheresNoTime>	 good luck! ^^
[10:11:53] <wikibugs>	 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto)
[10:13:32] <wikibugs>	 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto)
[11:28:07] <_joe_>	 elukey: just publish a null patch 
[11:29:18] <_joe_>	 the alternative is to log into jenkins, and re-run that jobs
[11:29:20] <_joe_>	 *job
[11:29:37] <_joe_>	 if it even made it to jenkins
[11:39:29] <wikibugs>	 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto)
[11:45:56] <hasharAway>	 elukey: the publish pipeline apparently failed due to à merge conflict
[11:47:19] <hasharAway>	 Even though the patch got merged by gerrit. It is probably an issue with the the zuul merger that hamdled the request
[11:47:58] <hasharAway>	 I am not there today to investigate, but maybe i will remember about it tonight :)
[12:46:20] <wikibugs>	 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto)
[12:55:02] <elukey>	 _joe_ yeah I wanted to do it but I was hoping to have something to quickly re-run, rather than gathering parameters for the jenkins job :)
[12:55:35] <elukey>	 hasharAway: thanks! yeah there was an issue with zuul earlier on
[13:26:39] <wikibugs>	 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto)
[13:39:02] <wikibugs>	 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Jelto)
[13:45:48] <wikibugs>	 10Project-Admins: Create project tag for <#DSE-K8S> - https://phabricator.wikimedia.org/T309095 (10JArguello-WMF) Hi @Aklapper ! Is there any other information we need to provide for the project tag? Thank you very much for your help.
[13:58:22] <wikibugs>	 10Project-Admins: Create project tag for <#DSE-K8S> - https://phabricator.wikimedia.org/T309095 (10Aklapper) 05Open→03Resolved a:03Aklapper Hi, requested public project #DSE-Kubernetes-Cluster has been created: https://phabricator.wikimedia.org/project/view/5959/  (In case you need to edit the project or p...
[13:58:39] <wikibugs>	 10Project-Admins: Create project tag for DSE-Kubernetes-Cluster (DSE-K8S) - https://phabricator.wikimedia.org/T309095 (10Aklapper)
[14:03:33] <wikibugs>	 10Project-Admins: Create project tag for DSE-Kubernetes-Cluster (DSE-K8S) - https://phabricator.wikimedia.org/T309095 (10JArguello-WMF) Thank you so much for your help @Aklapper !
[15:34:31] <wikibugs>	 10Continuous-Integration-Config, 10Release-Engineering-Team (Seen), 10Wikidata, 10Wikidata Query UI, and 2 others: Update wikidata-query-gui-build job from Node 12 to Node 14 - https://phabricator.wikimedia.org/T308579 (10Lucas_Werkmeister_WMDE) Indeed, [build #45](https://integration.wikimedia.org/ci/job/...
[15:39:47] <wikibugs>	 10Gerrit, 10Wikidata, 10Wikidata Query UI, 10wdwb-tech: wikidata-query-gui-build doesn’t work when latest commit is by dependabot (commit-msg hook adds Change-Id in wrong place) - https://phabricator.wikimedia.org/T295601 (10Lucas_Werkmeister_WMDE) >>! In T295601#7500334, @Lucas_Werkmeister_WMDE wrote: > T...
[15:42:07] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Patch-For-Review, 10Release, 10Train Deployments: 1.39.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T305219 (10dancy) 05Open→03Resolved
[16:05:30] <wikibugs>	 10Release-Engineering-Team (🌱 Spring Cleaning — April 2022), 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Make scap deploy to kubernetes together with the legacy systems - https://phabricator.wikimedia.org/T299648 (10dancy) >>! In T299648#7907880, @dancy wrote: > @Joe Regarding https://gerrit.wikimedia.o...
[16:19:29] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Release, 10Train Deployments: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 (10Jdlrobson)
[16:35:46] <wikibugs>	 10Release-Engineering-Team (🌱 Spring Cleaning — April 2022): Delete wmf branches from Gerrit repositories - https://phabricator.wikimedia.org/T303828 (10Krinkle) I currently have the following aliases ([dotfiles repo](https://github.com/Krinkle/dotfiles/blob/v2022.05/gitconfig#L51-L70)):  `  # Wildcard deletion...
[16:42:40] <Krinkle>	 hasharAway: want to collab next week and finish T247653 ?
[16:42:41] <stashbot>	 T247653: replace doc1001.eqiad.wmnet with a buster VM and create the codfw equivalent - https://phabricator.wikimedia.org/T247653
[16:42:52] <Krinkle>	 (to unbreak OOUI demos which need php 7.2+)
[16:46:14] <wikibugs>	 10Continuous-Integration-Infrastructure, 10OOUI: Demos page for OOUI in php is broken - https://phabricator.wikimedia.org/T297035 (10Krinkle) a:03Krinkle
[17:25:58] <hasharAway>	 Krinkle: that one is overdue indeed. Thursday would work for me
[18:00:50] <Krinkle>	 Okay!
[18:22:06] <wikibugs>	 10Release-Engineering-Team, 10Gerrit-Privilege-Requests: Request for Gerrit Managers permissions for karapayneWMDE - https://phabricator.wikimedia.org/T302262 (10Majavah) Pinging @QChris who's been taking care of most repository requests. I believe the diffusion/github mirrors need to be created manually and t...
[18:34:44] <wikibugs>	 10Release-Engineering-Team (Priority Backlog 📥), 10Release, 10Train Deployments: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 (10Jdlrobson)
[18:50:16] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] Look for mw:moduleStyles meta tag in Parsoid output as well [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/794774 (owner: 10Subramanya Sastry)
[18:51:24] <wikibugs>	 (03Merged) 10jenkins-bot: Look for mw:moduleStyles meta tag in Parsoid output as well [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/794774 (owner: 10Subramanya Sastry)
[18:51:45] <wikibugs>	 (03Merged) 10jenkins-bot: Fix arwiki Cite CSS [integration/visualdiff] - 10https://gerrit.wikimedia.org/r/795721 (owner: 10Subramanya Sastry)
[20:32:07] <wmf-insecte>	 Project beta-update-databases-eqiad build #58881: 04FAILURE in 12 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58881/
[20:32:08] <wmf-insecte>	 Project beta-code-update-eqiad build #393492: 04FAILURE in 9 min 7 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393492/
[20:42:58] <wikibugs>	 10Beta-Cluster-Infrastructure: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime)
[20:43:54] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) p:05Triage→03High
[20:48:01] <dancy>	 TheresNoTime:  I can reboot it.
[20:48:13] <TheresNoTime>	 dancy: if you wouldn't mind :)
[20:48:37] <TheresNoTime>	 I'm only a "user" (though I'm going to log a task to get that changed now)
[20:49:28] <dancy>	 !log Initiated hard reboot of deployment-deploy03.deployment-prep
[20:49:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[20:50:06] <dancy>	 ok it's back
[20:50:18] <wikibugs>	 10Beta-Cluster-Infrastructure: Grant `Samtar` admin access to the deployment-prep project - https://phabricator.wikimedia.org/T309415 (10TheresNoTime)
[20:50:37] <TheresNoTime>	 dancy: thank you! :D
[20:50:55] <zabe>	 TheresNoTime, I think you could have rebooted it yourself through cumin
[20:51:11] <dancy>	 ssh in wasn't working
[20:51:27] <TheresNoTime>	 ^ :((
[20:51:32] <zabe>	 ssh to cumin was working, and pinging deploy03 was also working
[20:51:37] <hauskatze>	 unplug-and-replug may also work :P
[20:51:58] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10dancy) I rebooted using the horizon UI.
[20:51:59] <dancy>	 presumably when cumin then tries to ssh to deploy03 it would hang
[20:52:11] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) 05Open→03Resolved a:03dancy @dancy rebooted `deployment-deploy03` and it is now accessible
[20:52:23] <zabe>	 maybe, doing it through horizon definetly doesn't hurt
[20:53:00] <bd808>	 !log `sudo wmcs-openstack role add --user samtar --project deployment-prep projectadmin` (T309415)
[20:53:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[20:53:02] <stashbot>	 T309415: Grant `Samtar` admin access to the deployment-prep project - https://phabricator.wikimedia.org/T309415
[20:53:05] <TheresNoTime>	 hey at least https://github.com/theresnotime/jenkins-watch got a real world test \o/
[20:54:06] <TheresNoTime>	 https://usercontent.irccloud-cdn.com/file/buJ7cVlg/image.png
[20:54:23] <dancy>	 Fancy
[20:54:25] <wmf-insecte>	 Yippee, build fixed!
[20:54:25] <wmf-insecte>	 Project beta-code-update-eqiad build #393493: 09FIXED in 4 min 3 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393493/
[20:54:28] <TheresNoTime>	 oh and thanks for doing that bd808 :)
[20:55:05] <bd808>	 TheresNoTime: now you are obligated to fix all the problems ;)
[20:55:43] <zabe>	 have fun with this mess called beta cluster :p
[20:55:47] <wikibugs>	 10Beta-Cluster-Infrastructure, 10User-bd808: Grant `Samtar` admin access to the deployment-prep project - https://phabricator.wikimedia.org/T309415 (10TheresNoTime) 05Open→03Resolved a:03bd808
[20:55:56] <wmf-insecte>	 Project beta-scap-sync-world build #52918: 04FAILURE in 1 min 15 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52918/
[20:57:41] <TheresNoTime>	 zabe: I am cursed with finding everything "quite interesting", so spend my time jumping between {things}
[20:57:45] <TheresNoTime>	 :P
[20:58:40] <wmf-insecte>	 Project beta-scap-sync-world build #52919: 04STILL FAILING in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52919/
[20:59:42] <TheresNoTime>	 aw heck, "Load key "/etc/keyholder.d/mwdeploy.pub": invalid format" --> "mwdeploy@deployment-jobrunner04.deployment-prep.eqiad1.wikimedia.cloud: Permission denied (publickey)."
[20:59:57] <TheresNoTime>	 ^ for the sync :/
[21:00:56] <dancy>	 Hmm.. looks like someone needs to arm the keyholder.  
[21:01:06] <dancy>	 I don't know who usually does that.
[21:01:46] <dancy>	 I tried running `keyholder arm` but it wants a passphrase
[21:05:29] <bd808>	 dancy: hmmm... I thought the password was in ~root, but I'm not seeing it there.
[21:05:49] <wmf-insecte>	 Project beta-scap-sync-world build #52920: 04STILL FAILING in 52 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52920/
[21:06:59] <zabe>	 the password is in puppetmaster, let me run keyholder arm
[21:08:13] <TheresNoTime>	 zabe: have done!
[21:08:42] <TheresNoTime>	 (or followed https://wikitech.wikimedia.org/wiki/Keyholder at least)
[21:09:33] <zabe>	 I ran 'sudo keyholder arm' and typed in all keys, maybe we have done it twice now 
[21:10:24] <wmf-insecte>	 Yippee, build fixed!
[21:10:24] <wmf-insecte>	 Project beta-scap-sync-world build #52921: 09FIXED in 1 min 18 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52921/
[21:10:50] <zabe>	 !log zabe@deployment-deploy03:~$ sudo keyholder arm
[21:10:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[21:11:00] <TheresNoTime>	 well no harm no foul, at least its working \o/
[21:11:01] <bd808>	 sigh. Honestly deployment-prep's keyholder passwords should all be on that wiki page directly. and the should all be the same.
[21:11:53] <bd808>	 in the long ago they were all passphrase free, but I think something was changed in keyholder that made it barf on that (insecure private keys)
[21:14:16] <zabe>	 as said, it's a mess ¯\_(ツ)_/¯
[21:15:49] <bd808>	 I will wander back to 'real' work before I end up trying to fix all the busted windows and really just making things worse ;)
[21:17:39] <mutante>	 I once spent some time on making a bunch of them have the same passphrase .. in production.
[21:18:09] <mutante>	 seems like you found they are not the same in beta but somewhere in /var/lib/git/labs/private/files/ssh/tin/  on the beta puppetmaster
[21:18:29] <mutante>	 also that "tin" in there is not a thing anymore. that was the name of the prod deployment server many years ago
[21:19:13] <bd808>	 I think we had 'deployment-tin' in beta for about 3 years after tin was decommed in prod :)
[21:19:58] <mutante>	 heh, yea. we shouldn't even use the numbers in the hostnames I guess. so more like "deployment-deploy" :p
[21:20:16] <zabe>	 ^^ yes, that was where I got they passphrases from
[21:20:28] <mutante>	 ok, good, at least it armed it
[21:20:39] <mutante>	 there is some "keyholder status" as well
[21:21:49] <mutante>	 see that table on the wiki page? how a lot of them are "deployment-key-passphrase"? once every single line had a differnt passphrase :P
[21:22:02] <mutante>	 imagine that.. 20 different passwords until you got them all armed
[21:22:09] <TheresNoTime>	 o_o
[21:22:28] <mutante>	 just saying .. prod unified them..so beta can too
[21:22:41] <mutante>	 well..mostly unified
[21:24:22] <wmf-insecte>	 Project beta-update-databases-eqiad build #58882: 04STILL FAILING in 11 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58882/
[21:24:23] <wmf-insecte>	 Project beta-code-update-eqiad build #393497: 04FAILURE in 1 min 21 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393497/
[21:25:09] <zabe>	 huh
[21:25:38] <TheresNoTime>	 same issue
[21:26:01] <TheresNoTime>	 zabe: all yours
[21:28:08] <zabe>	 I tried cumin out of curiosity, but it is actually not working
[21:29:46] <TheresNoTime>	 o.o it *seems* to just be `deploy03`? Is it worth trashing & rebuilding it?
[21:29:59] <TheresNoTime>	 its "just" a deployment host right..?
[21:30:06] <zabe>	 yes
[21:30:29] <mutante>	 there will be probably be puppet errors when you apply the role to a fresh instance
[21:30:42] <mutante>	 but only one way to find out
[21:31:20] <mutante>	 if quota doesn't get in your way.. I would first make a the new one before touching an old one
[21:31:42] <mutante>	 that way you have something to compare too
[21:33:16] <wikibugs>	 10Beta-Cluster-Infrastructure: Grant Zabe admin access to deployment-prep - https://phabricator.wikimedia.org/T309419 (10Zabe)
[21:34:19] <TheresNoTime>	 Oh I thought you already were a member zabe !
[21:34:34] <zabe>	 I don't have access to horizon
[21:34:57] <zabe>	 so yeah, I am a member, but not an admin
[21:37:53] <DannyS712>	 I'm trying to connect to deploy03 but after I entered my ssh key nothing is happening, and when I tried to manually ping the server I got 3 time outs and a destination net unreachable
[21:38:03] <DannyS712>	 is this being worked on?
[21:38:16] <TheresNoTime>	 I am now looking at it
[21:38:44] <mutante>	 it will probably work if you reboot the instance
[21:38:58] <mutante>	 try 'soft reboot'
[21:40:41] <mutante>	 zabe: you should have horizon access with just your wikitech user. you may have to select the deployment-prep project from a dropdown though to switch context
[21:42:32] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 unresponsive - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) 05Resolved→03Open a:05dancy→03TheresNoTime Issue repeated, looking at it now
[21:42:44] <zabe>	 ah, yes
[21:43:13] <zabe>	 but it's in a read-only mode if I see it correctly
[21:44:19] <bd808>	 !log `sudo wmcs-openstack role add --user zabe --project deployment-prep projectadmin` (T309419)
[21:44:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[21:44:22] <stashbot>	 T309419: Grant Zabe admin access to deployment-prep - https://phabricator.wikimedia.org/T309419
[21:44:35] <zabe>	 bd808, thanks :)
[21:44:45] <TheresNoTime>	 !log hard rebooted deployment-deploy03 as soft reboot unresponsive
[21:44:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[21:45:07] <bd808>	 zabe: np. I think you will need to log out/login to horizon for it to see your new powers
[21:45:28] <wikibugs>	 10Beta-Cluster-Infrastructure, 10User-bd808: Grant Zabe admin access to deployment-prep - https://phabricator.wikimedia.org/T309419 (10bd808) 05Open→03Resolved a:03bd808
[21:45:39] <zabe>	 yep
[21:47:23] <TheresNoTime>	 Still unable to SSH into it - going to try a "rebuild", agree?
[21:47:25] <wmf-insecte>	 Project beta-update-databases-eqiad build #58883: 04STILL FAILING in 3 min 23 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58883/
[21:48:02] <TheresNoTime>	 or would shutting it down, and spinning up a fresh VM with the puppet roles be smarter, to keep that old VM there?
[21:48:33] <wmf-insecte>	 Yippee, build fixed!
[21:48:33] <wmf-insecte>	 Project beta-code-update-eqiad build #393498: 09FIXED in 4 min 31 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393498/
[21:48:44] <TheresNoTime>	 ...
[21:48:52] <zabe>	 I can ssh now
[21:50:00] <wmf-insecte>	 Project beta-scap-sync-world build #52923: 04FAILURE in 1 min 26 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52923/
[21:50:06] <mutante>	 afair it was "admins can create instances" and members can't. but members can still ssh to instances?
[21:50:41] <TheresNoTime>	 ah yep, can SSH in now \o/ still, twice this has happened, probably still worth rebuilding/spinning up a new instance ...?
[21:50:45] <zabe>	 mutante, yes, basically members have full root access to the hosts but can't manage them through horizon
[21:50:52] <bd808>	 ^ that
[21:52:21] <zabe>	 TheresNoTime, if it is not happening again (let's hope), I would leave it. If it is happening again, we can try that, but then please create a task for that ;)
[21:52:45] <TheresNoTime>	 sounds good :)
[21:55:00] <bd808>	 "May 27 21:47:21 deployment-deploy03 php: PHP Fatal error:  Out of memory (allocated 7487094784) (tried to allocate 20480 bytes) in /srv/mediawiki-staging/php-master/extensions/WikiLambda/includes/ZObjectFactory.php on line 158" -- Not sure what's going on there on the deploy server but maybe related to it locking up/getting slow
[21:56:32] <mutante>	 maybe try restarting the php-fpm service
[21:57:15] <TheresNoTime>	 `root  ttyS0  Fri May 27 20:49 - crash  (00:53)` is in the `last -5 reboot shutdown root`
[21:59:10] <wmf-insecte>	 Project beta-scap-sync-world build #52924: 04STILL FAILING in 3 min 9 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52924/
[21:59:56] <bd808>	 I see both php-fpm7.2 and php-fpm7.4 processes running and have no idea why that would be. Or honestly why a deployment server is running php-fpm at all.
[22:01:32] <bd808>	 there is no php-fpm at all running on deploy1002.eqiad.wmnet
[22:02:29] <mutante>	 eh, yea, one with the version number but I wasn't sure what version it runs
[22:02:37] <mutante>	 both at the same time..sounds wrong
[22:03:57] <mutante>	 true, prod deploy server does not run it.. but does have php packages installed
[22:04:29] <zabe>	 https://gerrit.wikimedia.org/g/operations/puppet/+/fddf4a9a7d104eb05edf65fccf0de3c5b5ec700c/hieradata/cloud/eqiad1/deployment-prep/common.yaml#115
[22:04:36] <zabe>	 ^ it's explictly enabled there
[22:05:30] <zabe>	 and I think it's not explictly disabled for the deployment host in deployment-prep
[22:05:32] <mutante>	 the problem seems to be that it's in common 
[22:05:34] <bd808>	 yeah, php needs to be installed to run some parts of scap but I don't know why there would be a web php service on a deployment box. But this is back in the rabbit hole I said I would not diven down
[22:05:39] <mutante>	 which enables it for every instance in the project
[22:05:45] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) p:05High→03Triage a:05TheresNoTime→03None
[22:05:46] <wmf-insecte>	 Project beta-scap-sync-world build #52925: 04STILL FAILING in 53 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52925/
[22:06:25] <mutante>	 my guess is at some point in the past deployment servers included the same base as appservers and then it changed
[22:07:34] <mutante>	 or it never did but then this should have never been put in the "common.yaml". hiera should be role based..and if that doesn't work in beta then prefix based
[22:07:56] <zabe>	 TheresNoTime, you wanna run keyholder arm, or should I?
[22:08:07] <TheresNoTime>	 zabe: just started doing it
[22:08:12] <zabe>	 ok :)
[22:09:18] <TheresNoTime>	 !log samtar@deployment-deploy03:~$ sudo keyholder arm
[22:09:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[22:10:35] <zabe>	 yeah, deployment-prep puppet configuration needs some love (like the rest of the deployment-prep infrastructure)
[22:21:00] <wmf-insecte>	 Yippee, build fixed!
[22:21:00] <wmf-insecte>	 Project beta-scap-sync-world build #52926: 09FIXED in 6 min 11 sec: https://integration.wikimedia.org/ci/job/beta-scap-sync-world/52926/
[22:23:19] <wmf-insecte>	 Project beta-update-databases-eqiad build #58884: 04STILL FAILING in 3 min 19 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58884/
[22:32:13] <TheresNoTime>	 zabe: OOMing, trying to cancel that database job
[22:32:45] <wmf-insecte>	 Project beta-update-databases-eqiad build #58885: 15ABORTED in 4 min 48 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58885/
[22:32:49] <TheresNoTime>	   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
[22:32:49] <TheresNoTime>	    43 root      20   0       0      0      0 S 100.0   0.0   0:21.99 kswapd0
[22:32:49] <TheresNoTime>	 32551 www-data  20   0 7422512   6.8g      0 D  16.7  87.5   0:23.69 php
[22:36:39] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) While running a step of `beta-update-databases-eqiad`, we go OOM and unresponsive:  `   PID  USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM...
[22:38:33] <wmf-insecte>	 Project beta-code-update-eqiad build #393503: 15ABORTED in 5 min 33 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393503/
[22:42:50] <zabe>	 There is this wikilambda patch, which might cause the database update to lag: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikiLambda/+/798987
[22:42:51] <zabe>	 but not sure
[22:49:02] <TheresNoTime>	 !log manually running database update script: samtar@deployment-deploy03:~$ /usr/local/bin/wmf-beta-update-databases.py
[22:49:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[22:49:21] <wmf-insecte>	 Project beta-code-update-eqiad build #393504: 15ABORTED in 37 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393504/
[22:53:10] <wmf-insecte>	 Project beta-code-update-eqiad build #393505: 15ABORTED in 10 sec: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/393505/
[22:55:52] <zabe>	 !log zabe@deployment-mwmaint02:~$ mwscript extensions/WikiLambda/maintenance/updateTypedLists.php --wiki=wikifunctionswiki --db # started ~20 min ago
[22:55:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL
[22:58:08] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10TheresNoTime) a:03TheresNoTime
[23:01:00] <TheresNoTime>	 Okay that database update worked, but took a long time
[23:01:49] <zabe>	 yeah I manually ran that migration script
[23:02:16] <zabe>	 you wanna try kicking beta-update-databases-eqiad
[23:02:17] <zabe>	 ?
[23:02:50] <TheresNoTime>	 going to let https://integration.wikimedia.org/ci/view/Beta/job/beta-code-update-eqiad/393506/console go through first
[23:07:02] <TheresNoTime>	 running https://integration.wikimedia.org/ci/view/Beta/job/beta-update-databases-eqiad/58886/console
[23:13:20] <zabe>	 If it is not going to work, we probably need to create a task, because somehow wikilambda seems to always checks all z objects, wether they are migrated and if not migrate them. That takes ages.
[23:15:56] <wmf-insecte>	 Yippee, build fixed!
[23:15:57] <wmf-insecte>	 Project beta-update-databases-eqiad build #58886: 09FIXED in 9 min 9 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58886/
[23:15:58] <TheresNoTime>	 well I don't want to jinx it, but it appears to have gotten further than it did before
[23:16:00] <TheresNoTime>	 oooh
[23:17:16] <zabe>	 nice, the wikilambda thing "only" needed ~8 min to check all items, which seems to be just fine. And it does it without running out of memory.
[23:21:49] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 crashed twice - https://phabricator.wikimedia.org/T309413 (10Zabe) FTR, it seems like beta-update-databases-eqiad was running out of memory while trying to perform the migration added in https://gerrit.wikimedia.org/r/c/med...
[23:36:10] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime)
[23:38:12] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime)
[23:40:05] <wikibugs>	 10Beta-Cluster-Infrastructure, 10Release-Engineering-Team, 10SRE: deployment-deploy03 ran out of memory twice while trying to perform a WikiLambda db migration - https://phabricator.wikimedia.org/T309413 (10TheresNoTime)