[00:07:56] 10GitLab (Project Migration), 06Community-Tech, 06translatewiki.net, 10WS Export, 13Patch-For-Review: Migrate ws-export repo from GitHub to GitLab - https://phabricator.wikimedia.org/T395398#11104952 (10Samwilson) [03:23:50] FIRING: [2x] InstanceDown: Project deployment-prep instance deployment-db14 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:23:50] FIRING: WidespreadInstanceDown: Widespread instances down in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:24:57] FIRING: WidespreadInstanceDown: Widespread instances down in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:25:04] FIRING: [44x] InstanceDown: Project deployment-prep instance deployment-acme-chief05 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [03:25:31] RESOLVED: WidespreadInstanceDown: Widespread instances down in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DWidespreadInstanceDown [03:28:49] RESOLVED: [44x] InstanceDown: Project deployment-prep instance deployment-acme-chief05 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [05:37:20] (03PS1) 10Hslater: Zuul: [mediawiki/extensions/UserProfile] Add BlueSpiceDiscovery as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180708 [06:38:26] (03Abandoned) 10Hashar: quibble-coverage: Fix `--path-to-mw` arg for skins [integration/config] - 10https://gerrit.wikimedia.org/r/1180606 (https://phabricator.wikimedia.org/T395470) (owner: 10Hashar) [06:42:50] (03PS5) 10Hashar: quibble-coverage: use os.path.join() in PHPUnit suite edit [integration/config] - 10https://gerrit.wikimedia.org/r/1180603 (https://phabricator.wikimedia.org/T402398) [07:14:32] (03Merged) 10jenkins-bot: jjb: add python 3.13 to labs/tools/heritage job [integration/config] - 10https://gerrit.wikimedia.org/r/1180716 (https://phabricator.wikimedia.org/T396273) (owner: 10Hashar) [07:21:44] (03CR) 10Hashar: [C:03+2] jjb: update jobs for /srv/composer removal [integration/config] - 10https://gerrit.wikimedia.org/r/1180717 (owner: 10Hashar) [07:25:35] (03CR) 10Zuul test: "Merge Failed." [integration/config] - 10https://gerrit.wikimedia.org/r/1180719 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [07:25:47] (03CR) 10CI reject: [V:04-1] jjb: update quibble-coverage jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1180719 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [07:40:16] (03PS4) 10Hashar: quibble-coverage: make phpunit-suite-edit skin agnostic [integration/config] - 10https://gerrit.wikimedia.org/r/1180607 (https://phabricator.wikimedia.org/T402398) [07:40:16] (03PS1) 10Hashar: Zuul: switch patch coverage for skin to extension job [integration/config] - 10https://gerrit.wikimedia.org/r/1180720 (https://phabricator.wikimedia.org/T402398) [07:40:21] (03PS1) 10Hashar: jjb: update quibble-coverage jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1180723 (https://phabricator.wikimedia.org/T402398) [08:00:44] (03CR) 10Hashar: [C:03+2] quibble-coverage: make phpunit-suite-edit skin agnostic [integration/config] - 10https://gerrit.wikimedia.org/r/1180607 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [08:02:13] (03Merged) 10jenkins-bot: quibble-coverage: use os.path.join() in PHPUnit suite edit [integration/config] - 10https://gerrit.wikimedia.org/r/1180603 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [08:02:28] (03Merged) 10jenkins-bot: quibble-coverage: mwext-phpunit-coverage-patch usable by skins [integration/config] - 10https://gerrit.wikimedia.org/r/1180591 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [08:20:41] 06Project-Admins, 07Tracking-Neverending: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#11105532 (10AbbanWMDE) Hey, Can I get added to acl*Project-Admins, please? Currently my team can't create sprints because the people with the correct privileg... [08:48:33] (03PS3) 10Hashar: jjb: update quibble-coverage jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1180719 (https://phabricator.wikimedia.org/T402398) [08:48:34] (03PS1) 10Hashar: quibble-coverage: rm "extensions" path from phpunit-suite-edit [integration/config] - 10https://gerrit.wikimedia.org/r/1180817 [08:48:49] (03CR) 10Hashar: [C:03+2] quibble-coverage: rm "extensions" path from phpunit-suite-edit [integration/config] - 10https://gerrit.wikimedia.org/r/1180817 (owner: 10Hashar) [08:50:08] (03Merged) 10jenkins-bot: quibble-coverage: rm "extensions" path from phpunit-suite-edit [integration/config] - 10https://gerrit.wikimedia.org/r/1180817 (owner: 10Hashar) [08:51:45] 06Project-Admins, 07Tracking-Neverending: Requests for addition to the #acl*Project-Admins group (in comments) - https://phabricator.wikimedia.org/T706#11105591 (10hashar) >>! In T706#11105532, @AbbanWMDE wrote: > Hey, > > Can I get added to acl*Project-Admins, please? Currently my team can't create sprints b... [08:58:14] (03CR) 10Hashar: [C:03+2] jjb: update quibble-coverage jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1180719 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [08:59:52] (03Merged) 10jenkins-bot: jjb: update quibble-coverage jobs [integration/config] - 10https://gerrit.wikimedia.org/r/1180719 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [09:04:46] 10Beta-Cluster-Infrastructure, 10RESTBase, 07Beta-Cluster-reproducible: HyperSwitch/errors/not found (404) on beta cluster: There was an issue displaying this preview - https://phabricator.wikimedia.org/T402206#11105660 (10Jgiannelos) In production the request path is: * Edge (ATS/Varnish) [1] * For speci... [09:08:01] (03CR) 10Hashar: "I have tried on a dummy change https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/1180811 by running the job: https://integration.w" [integration/config] - 10https://gerrit.wikimedia.org/r/1180720 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [09:20:32] (03CR) 10Hashar: [C:04-1] Zuul: switch patch coverage for skin to extension job [integration/config] - 10https://gerrit.wikimedia.org/r/1180720 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [10:00:48] 10GitLab (Project Migration), 10Wikimedia-GitHub, 07Epic: Migrate active Wikimedia repositories in GitHub to GitLab - https://phabricator.wikimedia.org/T305039#11105835 (10Samwilson) [10:28:19] 10GitLab (Account Approval), 06Release-Engineering-Team: Requesting GitLab account activation for Hiperterminal - https://phabricator.wikimedia.org/T402506 (10Hiperterminal) 03NEW [11:00:10] 10GitLab (Project Migration), 06Community-Tech, 10SVG Translate Tool: Migrate SVG Translate from GitHub to GitLab - https://phabricator.wikimedia.org/T402505#11105902 (10A_smart_kitten) [11:08:45] 10Scap: Exception while building "next" image - https://phabricator.wikimedia.org/T402508 (10jnuche) 03NEW [11:09:15] 10Scap: Exception while building "next" image - https://phabricator.wikimedia.org/T402508#11105950 (10jnuche) [11:30:03] maintenance-disconnect-full-disks build 730052 integration-agent-docker-1050 (/: 26%, /srv: 99%, /var/lib/docker: 35%): OFFLINE due to disk space [11:35:03] maintenance-disconnect-full-disks build 730053 integration-agent-docker-1050 (/: 26%, /srv: 11%, /var/lib/docker: 33%): RECOVERY disk space OK [12:12:45] (03PS1) 10Hslater: Zuul: [mediawiki/extensions/ContentProvisioning] Add OOJSPlus as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180844 [12:23:16] I swear, that is the last time I touch those mediawiki coverage scripts [12:23:45] (03CR) 10Hashar: [C:03+2] Zuul: [mediawiki/extensions/UserProfile] Add BlueSpiceDiscovery as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180708 (owner: 10Hslater) [12:23:52] (03CR) 10Hashar: [C:03+2] Zuul: [mediawiki/extensions/ContentProvisioning] Add OOJSPlus as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180844 (owner: 10Hslater) [12:25:15] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/UserProfile] Add BlueSpiceDiscovery as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180708 (owner: 10Hslater) [12:25:20] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/ContentProvisioning] Add OOJSPlus as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180844 (owner: 10Hslater) [12:35:47] 10Gerrit, 10ProveIt-Gadget: Users of ProvieIt gadget get a 403 Forbidden fetching i18n files from Gerrit/Gitiles - https://phabricator.wikimedia.org/T394916#11106127 (10Sophivorus) @Iniquity Hi! They don't. Setting up a Gerrit repo for Proveit, connecting it to TranslateWiki, and loading the translations f... [12:58:19] 10GitLab (CI & Job Runners), 06Release-Engineering-Team, 06collaboration-services, 10function-evaluator, and 4 others: Wikifunctions function orchestrator and evaluator test suites failing on GitLab CI with OOM errors - https://phabricator.wikimedia.org/T399348#11106231 (10akosiaris) >>! In T399348#1101099... [13:21:36] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 05Release, 05Train Deployments: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376#11106373 (10jnuche) 05Openβ†’03Resolved Things have been stable all day. It looks safe to close the task [13:23:15] 10Scap: Exception while building "next" image - https://phabricator.wikimedia.org/T402508#11106395 (10jnuche) [13:25:28] 10Beta-Cluster-Infrastructure, 10RESTBase, 07Beta-Cluster-reproducible, 13Patch-For-Review: HyperSwitch/errors/not found (404) on beta cluster: There was an issue displaying this preview - https://phabricator.wikimedia.org/T402206#11106400 (10Tgr) >>! In T402206#11105660, @Jgiannelos wrote: > Would it make... [14:19:23] (03PS1) 10Hslater: Zuul: [mediawiki/extensions/BlueSpiceAvatars] Add UserProfile as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180877 [14:31:15] 10Beta-Cluster-Infrastructure, 10RESTBase, 07Beta-Cluster-reproducible, 13Patch-For-Review: HyperSwitch/errors/not found (404) on beta cluster: There was an issue displaying this preview - https://phabricator.wikimedia.org/T402206#11106759 (10Krinkle) >>! In T402206#11105660, @Jgiannelos wrote: > In produc... [14:31:59] 10Scap: Exception while building "next" image - https://phabricator.wikimedia.org/T402508#11106762 (10Scott_French) The underlying failure is a bit earlier in the log: ` 08:04:21 [webserver-webserver-bookworm] sha256:af6db323f4b3396cf6d63cc918d146f5b63808d6da24630e8722e2a9b6650f1f 08:04:21 [webserver-webserver-... [14:35:52] (03CR) 10Jforrester: "Only whilst they're still being used / not yet migrated to. Codex is (was?) meant to be moving to Node22 soon, but Node24 only more slowly" [integration/config] - 10https://gerrit.wikimedia.org/r/1180552 (owner: 10Hashar) [14:52:04] 10Scap: Exception while building "next" image - https://phabricator.wikimedia.org/T402508#11106871 (10Scott_French) From the registry side: ` Aug 20 08:04:26 registry2004 docker-registry[631]: time="2025-08-20T08:04:26.999608527Z" level=error msg="error putting payload into blobstore: swift: Object Corrupted" g... [14:58:53] (03approved) 10dancy: make-container-image: remove php version metadata label [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/209 (https://phabricator.wikimedia.org/T401721) (owner: 10swfrench) [15:07:19] (03CR) 10Hashar: "Good! My goal was merely to simplify the job definitions." [integration/config] - 10https://gerrit.wikimedia.org/r/1180552 (owner: 10Hashar) [15:07:30] (03CR) 10Hashar: [C:03+2] Zuul: [mediawiki/extensions/BlueSpiceAvatars] Add UserProfile as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180877 (owner: 10Hslater) [15:08:29] (03CR) 10Jforrester: [C:03+1] "Yeah, that works." [integration/config] - 10https://gerrit.wikimedia.org/r/1180552 (owner: 10Hashar) [15:09:55] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/BlueSpiceAvatars] Add UserProfile as dependency [integration/config] - 10https://gerrit.wikimedia.org/r/1180877 (owner: 10Hslater) [15:34:46] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 05Release, 05Train Deployments: 1.45.0-wmf.15 deployment blockers - https://phabricator.wikimedia.org/T396376#11107196 (10hashar) @ArielGlenn and I did a log a triage and filed a few more issues, but nothing that looks to be troublesome. Congratulations @... [15:47:01] (03update) 10dancy: deploy_promote.py: Attribute wikiversions.json updates to the initiating user [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/990 (https://phabricator.wikimedia.org/T401779) [15:47:10] (03update) 10dancy: deploy_promote.py: Attribute wikiversions.json updates to the initiating user [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/990 (https://phabricator.wikimedia.org/T401779) [15:47:22] (03update) 10dancy: deploy_promote.py: Attribute wikiversions.json updates to the initiating user [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/990 (https://phabricator.wikimedia.org/T401779) [15:50:01] (03update) 10dancy: deploy_promote.py: Attribute wikiversions.json updates to the initiating user [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/990 (https://phabricator.wikimedia.org/T401779) [15:53:15] (03merge) 10dancy: deploy_promote.py: Attribute wikiversions.json updates to the initiating user [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/990 (https://phabricator.wikimedia.org/T401779) [15:53:53] (03open) 10dancy: Release 4.208.0 [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/992 [15:56:10] (03merge) 10dancy: Release 4.208.0 [repos/releng/scap] - 10https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/992 [15:57:00] (03CR) 10Hashar: [C:04-1] "The `mwskin-phpunit-coverage-patch` and `mwext-phpunit-coverage-patch` scripts are actually different!" [integration/config] - 10https://gerrit.wikimedia.org/r/1180720 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [15:57:27] (03CR) 10Hashar: [C:04-1] "Marking comment as unresolved ;)" [integration/config] - 10https://gerrit.wikimedia.org/r/1180720 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [15:59:13] Krinkle: I had fun with `phpunit-patch-coverage` and found out that when `--test-dir`Β is passed as an absolute path that results in discarding any of the files found to have been altered in HEAD cause they are given as relative paths :] [15:59:14] https://gerrit.wikimedia.org/r/c/integration/config/+/1180720/comments/fd8d4703_4a249424 [15:59:16] no urgency :] [15:59:37] I'd hoped I could eventually drop the mwskin- variants of the jobs :] [16:04:47] hashar: right, that'd be cool. [16:05:26] so I guess PHPUnit compares the paths as relative paths, not e.g. expanded/realpath [16:05:31] so absolute doesn't work. [16:05:45] is that the conclusion? [16:08:14] that is worse than that! :] [16:08:28] the issue is in phpunit-patch-coverage [16:08:46] it looks at the list of files altered in HEAD [16:08:52] which gives an array of relative paths [16:09:28] the items are then filtered so that each item must starts with --test-dir [16:09:49] for the mwskin variant, --test-dir is not passed and it thuse uses the default: `tests/phpunit` [16:10:27] the files found in HEAD starting with `tests/phpunit`Β are thus accepted [16:11:31] for the mwext variant, --test-dir is passed as an absolute path [16:11:32] https://gerrit.wikimedia.org/g/integration/config/+/refs/heads/master/dockerfiles/quibble-bullseye-php81-coverage/mwext-phpunit-coverage-patch.sh [16:12:34] the files found in HEAD (ex: `tests/phpunit/MySuperTest.php`) are thus discarded because they don't start with `/workspace/src/skins/Vector/tests/phpunit` [16:13:21] fun times [16:18:42] (03merge) 10swfrench: make-container-image: remove php version metadata label [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/209 (https://phabricator.wikimedia.org/T401721) [16:21:28] FIRING: InstanceDown: Project deployment-prep instance deployment-cache-text08 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:21:51] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557 (10wmcs-alerts) 03NEW [16:23:48] What has been going on with deployment-cache-text08.deployment-prep.eqiad1.wikimedia.cloud? Looking now [16:24:35] not responsive to ssh. I'm going to guess we have a new set of attacking bots [16:26:19] is the RAM cache pool correctly sized? https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&from=now-2d&to=now&timezone=browser&var-project=deployment-prep&var-instance=deployment-cache-text08&viewPanel=panel-40 [16:26:36] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107588 (10bd808) 05Openβ†’03In progress p:05Triageβ†’03High a:03bd808 This check flapped a few times yesterday. I will dig in deeper. My fear is that we have a new s... [16:27:44] that's worth looking into taavi [16:28:04] it is stalled on IO [16:28:08] :( [16:28:13] https://grafana.wmcloud.org/d/0g9N-7pVz/cloud-vps-project-board?orgId=1&from=now-1h&to=now&timezone=browser&var-project=deployment-prep&var-instance=deployment-cache-text08 [16:29:08] !log deployment-cache-text08 hard reboot via horizon (T402557) [16:29:10] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:29:11] T402557: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557 [16:31:28] RESOLVED: InstanceDown: Project deployment-prep instance deployment-cache-text08 is down - https://prometheus-alerts.wmcloud.org/?q=alertname%3DInstanceDown [16:31:35] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402559 (10wmcs-alerts) 03NEW [16:34:36] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402559#11107648 (10bd808) β†’14Duplicate dup:03T402557 [16:34:37] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107646 (10bd808) [16:35:30] varnish things are angry as hell on deployment-cache-text08 after the reboot. Looks like we need some puppet something changed. [16:37:24] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107659 (10bd808) `lang=shell-session,counterexample,lines=10 bd808@deployment-cache-text08.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv Info: Using environment 'prod... [16:42:39] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107670 (10bd808) Puppet is generating bad VCL in /etc/varnish/text-frontend.inc.vcl. I hacked it to compile and the next puppet run broke it again with this diff: ` --- /e... [16:42:58] (03PS1) 10Hashar: quibble-coverage: pass --test-dir as a relative path [integration/config] - 10https://gerrit.wikimedia.org/r/1180908 (https://phabricator.wikimedia.org/T288396) [16:44:17] (03CR) 10Hashar: [C:04-1] "That was caused by I07327632ddcb869b4b0d440c89c42883f3313362 which passed an absolute directory to `--test-dir`. The fix is to use a rela" [integration/config] - 10https://gerrit.wikimedia.org/r/1180720 (https://phabricator.wikimedia.org/T402398) (owner: 10Hashar) [16:44:54] I had enough with the phpunit coverage madness [16:57:05] !log Removed cherry-picks of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180577, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180220, https://gerrit.wikimedia.org/r/c/operations/puppet/+/1180166 (T402557) [16:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:57:08] T402557: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557 [16:57:19] 10Scap, 06serviceops: Provide MediaWiki app image PHP version in helm values - https://phabricator.wikimedia.org/T401721#11107703 (10Scott_French) 05In progressβ†’03Resolved [[ https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/209 | release!209 ]] was picked up with the deployment for T4... [16:59:40] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107729 (10bd808) `lang=shell-session,lines=10,counterexample bd808@deployment-cache-text08.deployment-prep.eqiad1:~$ sudo -i puppet agent -tv Info: Using environment 'prod... [17:05:25] !log `sudo /usr/local/bin/puppetserver-deploy-code` on deployment-puppetserver-1 (T402557) [17:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:05:28] T402557: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557 [17:06:02] * bd808 grumbles about things seeming to arbitrarily not happen. [17:13:10] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107838 (10bd808) >>! In T402557#11107702, @Stashbot wrote: > {nav icon=file, name=Mentioned in SAL (#wikimedia-releng), href=https://sal.toolforge.org/log/ZF6QzZgBvg159pQr... [17:17:28] FIRING: PuppetAgentFailure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [17:17:36] 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://phabricator.wikimedia.org/T402570 (10wmcs-alerts) 03NEW [17:25:21] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107915 (10Krinkle) I missed this error because whenever we execute `run-puppet-agent` on deployment-cache-text in Beta, after a Puppet patch that changes VCL files, it run... [17:26:04] bd808: oh huh, the reboot would have cleared the old config it was running so that took it down a second time. [17:26:15] Funny how that all interacts. [17:27:28] RESOLVED: PuppetAgentFailure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetAgentFailure [17:27:37] 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://phabricator.wikimedia.org/T402570#11107924 (10wmcs-alerts) [17:27:58] 10Beta-Cluster-Infrastructure: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11107925 (10BCornwall) Would it be possible to increase the RAM in the meantime? [17:28:49] Krinkle: yeah I could see how that might be the chain there. taavi noticed the ram exhaustion stuff as I was starting to poke. I wonder if we need to resize that instance? [17:29:34] The same does not happen on cache-upload [17:29:49] I'm not sure wher ethe memory is going but yeah maybe we can just not care and see if more is enough. [17:30:17] I'm guessing requestctl or some other new VCL code path is consuming more memory than before while figuring out stuff during a reload. [17:30:57] once it gets through, the memory goes down.. [17:31:23] I was half-way looking at resizing cache-text in horizon but unsure how well tested that is or what the consequences might be [17:31:38] I guess it's not too rare a thing, so yeah, SGTM? [17:32:03] Krinkle: I'm asking a.ndrewbogott in another channel if resize works these days. I think I does though. [17:32:09] I'll be re-applying the cherry picks now as I'm writing the next m-dot patches. [17:32:21] That will almost certainly fire the alert again for a short time, since it takes like 5-10 minutes to apply [17:32:28] taavi says resize works, so let me try that [17:32:39] k, I'll wait for that, and write the ATS code meanwhile [17:34:01] !log Resize deployment-cache-text08 from g4.cores2.ram4.disk20 -> g4.cores4.ram8.disk20 (T402557) [17:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [17:34:03] T402557: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557 [17:37:51] Krinkle: deployment-cache-text08 has 8G of ram now. hopefully that helps [17:39:43] alright, let's see [17:42:53] bd808: nice, that worked on the first try and pretty quick too [17:43:22] yay! [17:43:27] the only thing broken right now is https://test.wikipedia.beta.wmcloud.org/ not routing mobile user agents, which is intentional / the thing I'm fixing. [17:43:39] other wikis are fine [18:16:35] 10Gerrit, 10ProveIt-Gadget: Users of ProvieIt gadget get a 403 Forbidden fetching i18n files from Gerrit/Gitiles - https://phabricator.wikimedia.org/T394916#11108142 (10Iniquity) It seems to me that abandoning the practice of translating global gadgets via translatwiki.net and using the translations at Com... [18:26:47] 10Scap (SpiderPig πŸ•ΈοΈ): Add an 'audit-trail' note to SpiderPig train-deployment Gerrit patches (i.e., which human made this train deployment?) - https://phabricator.wikimedia.org/T401779#11108206 (10dancy) 05Openβ†’03Resolved a:03dancy Deployed via scap 4.208.0. [18:27:28] FIRING: PuppetSyncFailure: Failed to update Puppet repository /srv/git/operations/puppet on instance deployment-puppetserver-1 in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [18:27:34] 10Beta-Cluster-Infrastructure: Failed to update Puppet repository /srv/git/operations/puppet on instance deployment-puppetserver-1 in project deployment-prep - https://phabricator.wikimedia.org/T402578 (10wmcs-alerts) 03NEW [18:28:11] 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11108216 (10Krinkle) 05In progressβ†’03Resolved Thanks @bd808! VCL reload seem to work again now. [18:28:50] 10Scap (SpiderPig πŸ•ΈοΈ): Add an 'audit-trail' note to SpiderPig train-deployment Gerrit patches (i.e., which human made this train deployment?) - https://phabricator.wikimedia.org/T401779#11108221 (10dancy) @Aklapper I made a change that affects train operations and I see that you're on train duty next, so mak... [18:29:28] 06Release-Engineering-Team, 10Scap (SpiderPig πŸ•ΈοΈ), 07Essential-Work: Add an 'audit-trail' note to SpiderPig train-deployment Gerrit patches (i.e., which human made this train deployment?) - https://phabricator.wikimedia.org/T401779#11108222 (10dancy) p:05Triageβ†’03Medium [19:30:57] (03open) 10dancy: make-container-image/build-images.py: Mark --mediawiki-versions required [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/210 [19:31:00] (03update) 10dancy: make-container-image/build-images.py: Mark --mediawiki-versions required [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/210 [19:33:17] (03merge) 10dancy: make-container-image/build-images.py: Mark --mediawiki-versions required [repos/releng/release] - 10https://gitlab.wikimedia.org/repos/releng/release/-/merge_requests/210 [19:39:39] 10Release-Engineering-Team (Radar), 06MediaWiki-Engineering, 06serviceops, 13Patch-For-Review: Deprecate mwdebugXXXX hosts - https://phabricator.wikimedia.org/T397498#11108464 (10Scott_French) [19:57:28] RESOLVED: PuppetSyncFailure: Failed to update Puppet repository /srv/git/operations/puppet on instance deployment-puppetserver-1 in project deployment-prep - https://prometheus-alerts.wmcloud.org/?q=alertname%3DPuppetSyncFailure [19:57:33] 10Beta-Cluster-Infrastructure: Failed to update Puppet repository /srv/git/operations/puppet on instance deployment-puppetserver-1 in project deployment-prep - https://phabricator.wikimedia.org/T402578#11108520 (10wmcs-alerts) [20:08:48] y'all! Beta's puppetserver is down to only one non-hack cherry-pick and it is a thing being actively worked on. Nice! [20:18:53] Wow! Nice work everyone! [20:18:57] Hacks-- [21:01:50] 10Beta-Cluster-Infrastructure: Failed to update Puppet repository /srv/git/operations/puppet on instance deployment-puppetserver-1 in project deployment-prep - https://phabricator.wikimedia.org/T402578#11108759 (10bd808) 05Openβ†’03Invalid Double checked that this is resolved. I would guess that https://ge... [21:02:20] 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-text08 in project deployment-prep - https://phabricator.wikimedia.org/T402570#11108764 (10bd808) β†’14Duplicate dup:03T402557 [21:02:22] 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Project deployment-prep instance deployment-cache-text08 is down - https://phabricator.wikimedia.org/T402557#11108762 (10bd808) [21:04:04] 10Beta-Cluster-Infrastructure: Widespread instances down in project deployment-prep - https://phabricator.wikimedia.org/T402482#11108769 (10bd808) 05Openβ†’03Invalid I think this was ceph blips [21:12:22] 10Beta-Cluster-Infrastructure, 07Puppet: /usr/local/bin/puppetserver-deploy-code emits scary looking error messages during a `git rebase` operation - https://phabricator.wikimedia.org/T397877#11108801 (10bd808) 05In progressβ†’03Resolved [21:57:45] (03open) 10lmora: releases: Bump Codex to 2.3.1 [repos/ci-tools/libup-config] - 10https://gitlab.wikimedia.org/repos/ci-tools/libup-config/-/merge_requests/89 (https://phabricator.wikimedia.org/T402270) [22:37:00] (03approved) 10volker-e: releases: Bump Codex to 2.3.1 [repos/ci-tools/libup-config] - 10https://gitlab.wikimedia.org/repos/ci-tools/libup-config/-/merge_requests/89 (https://phabricator.wikimedia.org/T402270) (owner: 10lmora) [22:37:12] (03merge) 10volker-e: releases: Bump Codex to 2.3.1 [repos/ci-tools/libup-config] - 10https://gitlab.wikimedia.org/repos/ci-tools/libup-config/-/merge_requests/89 (https://phabricator.wikimedia.org/T402270) (owner: 10lmora) [23:40:36] 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), 07Performance Issue: CI jobs hanging for ~20 mins - https://phabricator.wikimedia.org/T402606 (10Reedy) 03NEW [23:48:32] 10Continuous-Integration-Infrastructure, 10ci-test-error (WMF-deployed Build Failure), 07Performance Issue: CI jobs hanging for ~20 mins - https://phabricator.wikimedia.org/T402606#11109260 (10bd808) I complained about similar timestamps in another CI bug report and heard it was likely related to the paralle... [23:53:32] 10Release-Engineering-Team (Priority Backlog πŸ“₯), 05Release, 05Train Deployments: 1.45.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T396377#11109267 (10SecurityPatchBot)