[09:19:30] mutante: got it, thank you! but yeah as taavi was saying no additional fw access should be needed for prometheus hosts, if that's not the case we should look into why [10:25:02] jbond: hmm I got PCC failing for me on a cloud environment over a secret() resource, https://puppet-compiler.wmflabs.org/output/888652/39815/deployment-acme-chief03.deployment-prep.eqiad.wmflabs/prod.deployment-acme-chief03.deployment-prep.eqiad.wmflabs.err [10:27:51] vgutierrez: did you add it to the private repo? [10:29:28] vgutierrez: should fix https://gerrit.wikimedia.org/r/c/labs/private/+/891504 [10:30:52] oh.. so that's failing since 2019 [10:31:03] 😅 [10:31:05] cheers jbond [10:31:07] lol [10:31:16] lol :) [10:31:22] np [10:38:08] Casually fixing a 4 year old bug lmao [10:52:19] hmmm [10:52:39] it seems like the systemd oneshot thingie broke some WMCS puppetmaster timers [10:53:03] stuff like puppet-git-sync-upstream.timer or remove_old_puppet_reports.timer [10:58:26] hm I thought we fixed all of those yesterday [11:04:02] ah, if puppet-git-sync-upstream is broken then the fix would not be applied, of course [11:08:59] vgutierrez: do you have an example VM where that happens? [11:09:28] I fixed that on traffic project puppetmaster [11:23:10] currently investigating why scap can't run on some hosts, have you run into this before? https://phabricator.wikimedia.org/T330360 [11:24:17] godog: yes is happening on multiple clusters [11:24:37] see also T326668 [11:24:37] T326668: Scap fails on debian bullseye targets - https://phabricator.wikimedia.org/T326668 [11:25:11] ah thank you volans, will link the two and add #sre [11:57:33] FYI we've just changed the merge policy of the cookbooks repo in gerrit to be 'Rebase-If-Necessary'. That should save from the neeed to always rebase your cookbooks patches if someone has touched another cookbook. The gate-and-submit should prevent patches that will break CI to be merged anyway and the existing risk of merging two patches that don't work together is already not prevented by the [11:57:39] current workflow. Let SRE I/F know if you ... [11:57:42] ... encounter any issue. [12:00:31] volans: oh yeaaaaah thank you <3 [12:35:11] hashar: some of our puppet are failing due to scap trying to sync from deploy1001.eqiad.wmnet. This is due to .config file in the -cache of those repo to sync containing git_server=deploy1001.eqiad.wmnet. I'm wondering how this file is created (added when creating the repo with scap?) and how I should update it, manually or through a scap command that will update all imapcted nodes? [12:55:10] nfraison: a .config? Do you mean scap/scap.cfg configuration files? if so those are in the deployed git repo and would need to be changed in gerrit/gitlab [12:55:56] and it looks like scap defaults to `git_server=deploy1001.eqiad.wmnet` [12:56:07] https://gitlab.wikimedia.org/repos/releng/scap/-/blob/35c7009699fcd974c7691d1942f5d2ba65a866b4/scap/config.py#L70 [12:56:18] but that should be set globally via /etc/scap/scap.cfg [12:56:40] no I really mean a .config file [12:56:40] nfraison@stat1004:/srv/deployment/analytics/hdfs-tools/deploy-cache$ grep git_server .config [12:56:40] git_server: deploy1001.eqiad.wmnet [12:56:54] Could it not default to deployment.wikimedia.org ? [12:56:58] fun. I have no idea what it is for [12:57:03] Or would the key issue on switch break it? [12:58:49] nfraison: can you paste the full puppet error? [12:59:04] Meanwhile I have looked at this .config in the src code and think I have now my answer: https://gitlab.wikimedia.org/repos/releng/scap/-/blob/master/scap/deploy.py#L553 [12:59:04] But not fully sure on how I should run the --refresh-config [13:00:19] The puppet error: Feb 22 18:16:35 stat1004 scap[26324]: @cee: {"@version": 1, "type": "scap", "host": "stat1004", "script": "/usr/bin/scap", "user": "analytics-deploy", "levelname": "ERROR", "pathname": "/var/lib/scap/scap/lib/python3.7/site-packages/scap/cli.py", "filename": "cli.py", "module": "cli", "stack_info": null, "lineno": 412, "funcName": "_handle_exception", "process": 26324, "message": "deploy-local failed: [13:00:19] Command 'git fetch --tags --jobs 14 --no-recurse-submodules' failed with exit code 128;\nstdout:\n\nstderr:\nfatal: unable to access 'http://deploy1001.eqiad.wmnet/analytics/hdfs-tools/deploy/.git/': The requested URL returned error: 503\n", "channel": "deploy-local", "@timestamp": "2023-02-22T18:16:35Z"} [13:02:16] not sure why Puppet is running scap there that is fun [13:02:48] I mean, Puppet does run scap when preparing a scap::target but I dont see why it would be rerun once the repo has been cloned once [13:03:24] I guess you could redeploy it (no idea of the effect) or manually adjust the config [13:03:52] ack [13:04:12] given I have no idea what generates that `.config` file [13:12:22] maybe puppet should run scap with that --refresh-config flag you found [13:14:09] then again I don't see why Puppet runs `scap deploy-local` given the host should already have the repo deployed on that host [13:24:56] Changing manually the file to point to deploy1002 make the scap command work but there is indeed a bad pattern as it try to reapply those target at each puppet run [13:24:56] nfraison@stat1008:/srv/deployment/wikimedia/discovery/analytics-cache$ sudo run-puppet-agent [13:24:56] ... [13:24:56] Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: created (corrective) [13:24:56] Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective) [13:24:56] Notice: /Stage[main]/Profile::Analytics::Cluster::Elasticsearch/Scap::Target[wikimedia/discovery/analytics]/Package[wikimedia/discovery/analytics]/ensure: created (corrective) [13:24:56] nfraison@stat1008:/srv/deployment/wikimedia/discovery/analytics-cache$ sudo run-puppet-agent [13:24:57] ... [13:24:57] Notice: /Stage[main]/Profile::Analytics::Hdfs_tools/Scap::Target[analytics/hdfs-tools/deploy]/Package[analytics/hdfs-tools/deploy]/ensure: created (corrective) [13:24:58] Notice: /Stage[main]/Profile::Analytics::Refinery::Repository/Scap::Target[analytics/refinery]/Package[analytics/refinery]/ensure: created (corrective) [13:24:58] Notice: /Stage[main]/Profile::Analytics::Cluster::Elasticsearch/Scap::Target[wikimedia/discovery/analytics]/Package[wikimedia/discovery/analytics]/ensure: created (corrective) [13:26:35] * volans would like to suggest to use a paste service for pasting logs, like https://phabricator.wikimedia.org/paste/ [13:26:55] That means the existence check in Scap::Target is borked [13:27:02] (iiuc) [13:27:14] fun times, maybe give it a try with debug to get more output? `puppet agent -tv --debug` [13:28:13] or try to redeploy it maybe? From the deployment server `scap deploy --limit stat1004.eqiad.wmnet` [13:31:03] nfraison: hashar: just and fyi in case you missed it i think this issue is the one i loged in T330394 [13:31:03] T330394: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 [13:31:58] ack will ammend it [13:32:27] yeah looks like that is the same [13:40:40] OH MY GOD [13:40:47] that git safe.directory issue is a nightmare [13:41:10] gotta rollback git I guess it got magically upgraded [13:41:36] I have a patch for the deployment server https://gerrit.wikimedia.org/r/c/operations/puppet/+/868002 [13:41:58] and the canonical task is https://phabricator.wikimedia.org/T325128 [13:51:39] so yeah tentatively (and I replied on the T330394 task): [13:51:40] T330394: Scap issues with stat hosts - https://phabricator.wikimedia.org/T330394 [13:51:58] the git repo has files owned by whatever deployment user is used for that repo (ie files are not owned by root) [13:52:28] when Puppet runs, it starts whatever magic we have (a type/provider implemented in ruby) which execute some `git` commands [13:52:36] which ends up being run as `root` cause that is how Puppet runs [13:53:04] git (as root) then fails due to the `.git` directory or other files being owned by another user [13:53:16] tldr: Puppet should not execute git commands as root [14:04:36] Interesting! We ran into this same issue [14:18:01] same here in Traffic! [15:09:42] what is some general documentation where I can see etcd keys in use by our mediawiki installation (primary dc, read only) (the ones that are not db related)? [15:10:26] I see https://wikitech.wikimedia.org/wiki/Conftool but those are generic [15:10:55] ah, found it: https://wikitech.wikimedia.org/wiki/MediaWiki_and_EtcdConfig [15:41:23] vgutierrez: thanks for adding tests to the nosniff patch (https://gerrit.wikimedia.org/r/c/operations/puppet/+/890512) - do you think it's ready to roll out? [15:52:00] legoktm: yeah, I wanted bblack opinion first though :) [15:52:53] ok! [15:56:08] jynus: that page looks ok, although interestingly doesn't mention or link to dbctl 🤔 [15:56:45] interesting, although I happened to look for non-dbctl keys [15:56:57] (primary and read only) [15:57:10] oh [15:57:15] it's mostly written by joe, pre-dbctl hah [17:13:56] thanks bblack! [17:14:16] np! [17:15:05] is this something we need to coordinate the roll out for or ok to merge and let puppet apply it?