[09:19:30] <godog>	 mutante: got it, thank you! but yeah as taavi was saying no additional fw access should be needed for prometheus hosts, if that's not the case we should look into why
[10:25:02] <vgutierrez>	 jbond: hmm I got PCC failing for me on a cloud environment over a secret() resource, https://puppet-compiler.wmflabs.org/output/888652/39815/deployment-acme-chief03.deployment-prep.eqiad.wmflabs/prod.deployment-acme-chief03.deployment-prep.eqiad.wmflabs.err
[10:27:51] <jbond>	 vgutierrez: did you add it to the private repo?
[10:29:28] <jbond>	 vgutierrez: should fix https://gerrit.wikimedia.org/r/c/labs/private/+/891504
[10:30:52] <vgutierrez>	 oh.. so that's failing since 2019
[10:31:03] <vgutierrez>	 😅
[10:31:05] <vgutierrez>	 cheers jbond 
[10:31:07] <volans>	 lol
[10:31:16] <jbond>	 lol :)
[10:31:22] <jbond>	 np
[10:38:08] <claime>	 Casually fixing a 4 year old bug lmao
[10:52:19] <vgutierrez>	 hmmm
[10:52:39] <vgutierrez>	 it seems like the systemd oneshot thingie broke some WMCS puppetmaster timers
[10:53:03] <vgutierrez>	 stuff like puppet-git-sync-upstream.timer or remove_old_puppet_reports.timer
[10:58:26] <taavi>	 hm I thought we fixed all of those yesterday
[11:04:02] <taavi>	 ah, if puppet-git-sync-upstream is broken then the fix would not be applied, of course
[11:08:59] <taavi>	 vgutierrez: do you have an example VM where that happens?
[11:09:28] <vgutierrez>	 I fixed that on traffic project puppetmaster
[11:23:10] <godog>	 currently investigating why scap can't run on some hosts, have you run into this before? https://phabricator.wikimedia.org/T330360
[11:24:17] <volans>	 godog: yes is happening on multiple clusters
[11:24:37] <volans>	 see also T326668
[11:24:37] <stashbot>	 T326668: Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668
[11:25:11] <godog>	 ah thank you volans, will link the two and add #sre
[11:57:33] <volans>	 FYI we've just changed the merge policy of the cookbooks repo in gerrit to be 'Rebase-If-Necessary'. That should save from the neeed to always rebase your cookbooks patches if someone has touched another cookbook. The gate-and-submit should prevent patches that will break CI to be merged anyway and the existing risk of merging two patches that don't work together is already not prevented by the 
[11:57:39] <volans>	 current workflow. Let SRE I/F know if you ...
[11:57:42] <volans>	 ... encounter any issue.
[12:00:31] <claime>	 volans: oh yeaaaaah thank you <3
[12:35:11] <nfraison>	 hashar: some of our puppet are failing due to scap trying to sync from deploy1001.eqiad.wmnet. This is due to .config file in the <folder>-cache of those repo to sync containing git_server=deploy1001.eqiad.wmnet. I'm wondering how this file is created (added when creating the repo with scap?) and how I should update it, manually or through a scap command that will update all imapcted nodes?
[12:55:10] <hashar>	 nfraison: a .config?  Do you mean scap/scap.cfg configuration files?  if so those are in the deployed git repo and would need to be changed in gerrit/gitlab
[12:55:56] <hashar>	 and it looks like scap defaults to `git_server=deploy1001.eqiad.wmnet`
[12:56:07] <hashar>	 https://gitlab.wikimedia.org/repos/releng/scap/-/blob/35c7009699fcd974c7691d1942f5d2ba65a866b4/scap/config.py#L70
[12:56:18] <hashar>	 but that should be set globally via /etc/scap/scap.cfg
[12:56:40] <nfraison>	 no I really mean a .config file
[12:56:40] <nfraison>	 nfraison@stat1004:/srv/deployment/analytics/hdfs-tools/deploy-cache$ grep git_server .config 
[12:56:40] <nfraison>	 git_server: deploy1001.eqiad.wmnet
[12:56:54] <claime>	 Could it not default to deployment.wikimedia.org ?
[12:56:58] <hashar>	 fun. I have no idea what it is for
[12:57:03] <claime>	 Or would the key issue on switch break it?
[12:58:49] <hashar>	 nfraison: can you paste the full puppet error?
[12:59:04] <nfraison>	 Meanwhile I have looked at this .config in the src code and think I have now my answer: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/deploy.py#L553
[12:59:04] <nfraison>	 But not fully sure on how I should run the --refresh-config
[13:00:19] <nfraison>	 The puppet error: Feb 22 18:16:35 stat1004 scap[26324]: @cee: {"@version": 1, "type": "scap", "host": "stat1004", "script": "/usr/bin/scap", "user": "analytics-deploy", "levelname": "ERROR", "pathname": "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", "filename": "cli.py", "module": "cli", "stack_info": null, "lineno": 412, "funcName": "_handle_exception", "process": 26324, "message": "deploy-local failed: 
[13:00:19] <nfraison>	 <FailedCommand> Command 'git fetch --tags --jobs 14 --no-recurse-submodules' failed with exit code 128;\nstdout:\n\nstderr:\nfatal: unable to access 'http://deploy1001.eqiad.wmnet/analytics/hdfs-tools/deploy/.git/': The requested URL returned error: 503\n", "channel": "deploy-local", "@timestamp": "2023-02-22T18:16:35Z"}
[13:02:16] <hashar>	 not sure why Puppet is running scap there that is fun
[13:02:48] <hashar>	 I mean, Puppet does run scap when preparing a scap::target  but I dont see why it would be rerun once the repo has been cloned once
[13:03:24] <hashar>	 I guess you could redeploy it (no idea of the effect) or manually adjust the config
[13:03:52] <nfraison>	 ack
[13:04:12] <hashar>	 given I have no idea what generates that `.config` file
[13:12:22] <hashar>	 maybe puppet should run scap with that --refresh-config flag you found
[13:14:09] <hashar>	 then again I don't see why Puppet runs `scap deploy-local`  given the host should already have the repo deployed on that host
[13:24:56] <nfraison>	 Changing manually the file to point to deploy1002 make the scap command work but there is indeed a bad pattern as it try to reapply those target at each puppet run
[13:24:56] <nfraison>	 nfraison@stat1008:/srv/deployment/wikimedia/discovery/analytics-cache$ sudo run-puppet-agent
[13:24:56] <nfraison>	 ...
[13:24:56] <nfraison>	 Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: created (corrective)
[13:24:56] <nfraison>	 Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective)
[13:24:56] <nfraison>	 Notice: /Stage[main]/Profile::Analytics::Cluster::Elasticsearch/Scap::Target[wikimedia/discovery/analytics]/Package[wikimedia/discovery/analytics]/ensure: created (corrective)
[13:24:56] <nfraison>	 nfraison@stat1008:/srv/deployment/wikimedia/discovery/analytics-cache$ sudo run-puppet-agent
[13:24:57] <nfraison>	 ...
[13:24:57] <nfraison>	 Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: created (corrective)
[13:24:58] <nfraison>	 Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective)
[13:24:58] <nfraison>	 Notice: /Stage[main]/Profile::Analytics::Cluster::Elasticsearch/Scap::Target[wikimedia/discovery/analytics]/Package[wikimedia/discovery/analytics]/ensure: created (corrective)
[13:26:35] * volans would like to suggest to use a paste service for pasting logs, like https://phabricator.wikimedia.org/paste/
[13:26:55] <claime>	 That means the existence check in Scap::Target is borked
[13:27:02] <claime>	 (iiuc)
[13:27:14] <hashar>	 fun times, maybe give it a try with debug to get more output?  `puppet agent -tv --debug`
[13:28:13] <hashar>	 or try to redeploy it maybe? From the deployment server `scap deploy --limit stat1004.eqiad.wmnet`
[13:31:03] <jbond>	 nfraison: hashar: just and fyi in case you missed it i think this issue is the one i loged in T330394
[13:31:03] <stashbot>	 T330394: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394
[13:31:58] <nfraison>	 ack will ammend it
[13:32:27] <hashar>	 yeah looks like that is the same
[13:40:40] <hashar>	 OH MY GOD
[13:40:47] <hashar>	 that git safe.directory issue is a nightmare
[13:41:10] <hashar>	 gotta rollback git I guess it got magically upgraded
[13:41:36] <hashar>	 I have a patch for the deployment server https://gerrit.wikimedia.org/r/c/operations/puppet/+/868002 
[13:41:58] <hashar>	 and the canonical task is https://phabricator.wikimedia.org/T325128
[13:51:39] <hashar>	 so yeah tentatively (and I replied on the T330394 task):
[13:51:40] <stashbot>	 T330394: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394
[13:51:58] <hashar>	 the git repo has files owned by whatever deployment user is used for that repo (ie files are not owned by root)
[13:52:28] <hashar>	 when Puppet runs, it starts whatever magic we have (a type/provider implemented in ruby) which execute some `git` commands
[13:52:36] <hashar>	 which ends up being run as `root` cause that is how Puppet runs
[13:53:04] <hashar>	 git (as root) then fails due to the `.git` directory or other files being owned by another user
[13:53:16] <hashar>	 tldr: Puppet should not execute git commands as root
[14:04:36] <inflatador>	 Interesting! We ran into this same issue
[14:18:01] <sukhe>	 same here in Traffic!
[15:09:42] <jynus>	 what is some general documentation where I can see etcd keys in use by our mediawiki installation (primary dc, read only) (the ones that are not db related)?
[15:10:26] <jynus>	 I see https://wikitech.wikimedia.org/wiki/Conftool but those are generic
[15:10:55] <jynus>	 ah, found it: https://wikitech.wikimedia.org/wiki/MediaWiki_and_EtcdConfig
[15:41:23] <legoktm>	 vgutierrez: thanks for adding tests to the nosniff patch (https://gerrit.wikimedia.org/r/c/operations/puppet/+/890512) - do you think it's ready to roll out?
[15:52:00] <vgutierrez>	 legoktm: yeah, I wanted bblack opinion first though :)
[15:52:53] <legoktm>	 ok! 
[15:56:08] <cdanis>	 jynus: that page looks ok, although interestingly doesn't mention or link to dbctl 🤔
[15:56:45] <jynus>	 interesting, although I happened to look for non-dbctl keys
[15:56:57] <jynus>	 (primary and read only)
[15:57:10] <cdanis>	 oh
[15:57:15] <cdanis>	 it's mostly written by joe, pre-dbctl hah
[17:13:56] <legoktm>	 thanks bblack!
[17:14:16] <bblack>	 np!
[17:15:05] <legoktm>	 is this something we need to coordinate the roll out for or ok to merge and let puppet apply it?