[00:20:29] 10Release-Engineering-Team, 10SRE, 10SRE-OnFire, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10Dzahn) list of repos that exist on deployment servers but do not appear in the kubernetes.yaml. (just using the string that is the first level of th... [01:04:52] 10Beta-Cluster-Infrastructure, 10Abstract Wikipedia team, 10Patch-For-Review: Create a Beta Cluster version of Wikifunctions.org - https://phabricator.wikimedia.org/T284162 (10ori) The orchestrator throws errors on the Beta Cluster because it's unable to get local issuer certificate: ` {"name":"function-orc... [01:09:40] 10GitLab, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) [01:09:58] 10GitLab, 10SRE, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) [01:12:07] 10GitLab, 10Data-Persistence-Backup, 10serviceops, 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) [01:12:18] 10GitLab (Infrastructure), 10SRE, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) [01:12:52] 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops, 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) [01:13:30] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: gitlab-restore: version detection fail / restore fail - https://phabricator.wikimedia.org/T308089 (10Dzahn) [02:06:28] 10Beta-Cluster-Infrastructure, 10service-runner: Service cannot make HTTPS requests due to missing ca-certificates in Docker image - https://phabricator.wikimedia.org/T309261 (10ori) [02:16:30] 10Beta-Cluster-Infrastructure, 10Abstract Wikipedia team, 10Patch-For-Review: Create a Beta Cluster version of Wikifunctions.org - https://phabricator.wikimedia.org/T284162 (10ori) Filed T309261 for the missing issuer certificates. Temporarily worked around this by setting `NODE_TLS_REJECT_UNAUTHORIZED=0` in... [02:25:25] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:21:27] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:21:37] 10Deployments, 10Release-Engineering-Team (Doing), 10Parsoid, 10SRE, 10bacula: Accidental removal of some files under /srv/deployment on deploy1002 - https://phabricator.wikimedia.org/T307349 (10jcrespo) Incident lightweight report: https://wikitech.wikimedia.org/wiki/Incidents/2022-05-2_deployment [06:31:14] 10Release-Engineering-Team, 10SRE, 10SRE-OnFire, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10jcrespo) @Dzahn That doesn't seem right- mediawiki-staging is the current main method of deploying mediawiki, and httpbb-tests seems in active usage... [07:00:10] 10Release-Engineering-Team, 10SRE, 10SRE-OnFire, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10jcrespo) This is a list of resources configured on puppet, but I am not sure if the list is exhaustive: ` File[/srv/deployment/scap] from /etc/puppe... [07:28:09] 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops, 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10jcrespo) Yesterday's gitlab full backup was of only 42KB FYI. I would consider that a backup failure. ` id: 445024, ts: 2022-05-26 05:... [07:36:17] 10Release-Engineering-Team, 10Gerrit-Privilege-Requests: Request for Gerrit Managers permissions for karapayneWMDE - https://phabricator.wikimedia.org/T302262 (10Aklapper) Tagging #release-engineering-team for potential advice as this has been sitting here for three months, as it's unclear to me who to proceed... [08:41:27] 10Project-Admins: Create project tag for <#DSE-K8S> - https://phabricator.wikimedia.org/T309095 (10BTullis) [13:20:03] Project beta-update-databases-eqiad build #58848: 04FAILURE in 2.6 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58848/ [13:47:56] weeee https://gerrit.wikimedia.org/r/800000/ [14:02:31] What a surprise that LibUp hit it. :-) [14:20:09] Project beta-update-databases-eqiad build #58849: 04STILL FAILING in 2.3 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58849/ [14:27:51] 10Beta-Cluster-Infrastructure, 10service-runner: Provide a means of shipping logs from Docker-run services in Beta to logstash - https://phabricator.wikimedia.org/T309319 (10ori) [14:59:11] 10Beta-Cluster-Infrastructure, 10service-runner: Service cannot make HTTPS requests due to missing ca-certificates in Docker image - https://phabricator.wikimedia.org/T309261 (10Jdforrester-WMF) Looking at https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/services/function-orchestrator/+/refs/heads/mast... [15:20:02] Project beta-update-databases-eqiad build #58850: 04STILL FAILING in 2.2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58850/ [15:27:10] Project beta-update-databases-eqiad build #58851: 04STILL FAILING in 2 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58851/ [15:32:53] taavi: Cool change number! [16:20:03] Project beta-update-databases-eqiad build #58852: 04STILL FAILING in 3.1 sec: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58852/ [16:51:04] !log puppetmaster-1001.devtools: resetting ops/puppet checkout to production branch [16:51:05] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:51:50] !log manually triggered beta-update-databases-eqiad post-merge of 2c7b5825 [16:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [16:59:39] Yippee, build fixed! [16:59:39] Project beta-update-databases-eqiad build #58853: 09FIXED in 10 min: https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/58853/ [17:51:16] 10Continuous-Integration-Config: Set beta-code-update-eqiad to fail on timeout - https://phabricator.wikimedia.org/T309339 (10TheresNoTime) [17:53:18] (03PS1) 10Samtar: beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) [18:08:07] (03CR) 10Ahmon Dancy: "Great idea. I have a question about the default timeout behavior." [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:11:04] (03CR) 10Samtar: beta.yaml: set beta update jobs to fail (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:13:05] (03CR) 10Ahmon Dancy: [C: 03+1] beta.yaml: set beta update jobs to fail (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:17:39] (03CR) 10Ahmon Dancy: [C: 03+1] beta.yaml: set beta update jobs to fail (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:19:51] TheresNoTime: I updated the beta-* jobs based on 800199, 1 [18:21:46] (03CR) 10Ahmon Dancy: [V: 03+1 C: 03+1] beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:22:20] (03PS2) 10Samtar: beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) [18:22:25] (03CR) 10Samtar: beta.yaml: set beta update jobs to fail (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:23:05] dancy: ^ good points [18:23:59] (03CR) 10Ahmon Dancy: [C: 03+2] "Thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:24:20] (03CR) 10CI reject: [V: 04-1] beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:24:27] oop [18:26:06] `All Jenkins jobs must have a timeout` [18:26:55] specifically for beta-scap-sync-world. [18:27:17] I guess because it overrides the wrappers key [18:27:49] I propose leaving its original timeout section, but add fail: true to it too [18:27:56] (03PS3) 10Samtar: beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) [18:28:52] dancy: hm, I went back to the original timeout section, will it not pick up the default fail: true? [18:29:23] That's a good question. That's my impression based on how the test failed. [18:30:00] to save another patch set, I'll let that merge and we can revisit? [18:30:06] yes. [18:30:11] (03CR) 10Ahmon Dancy: [C: 03+2] beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:32:47] (03Merged) 10jenkins-bot: beta.yaml: set beta update jobs to fail [integration/config] - 10https://gerrit.wikimedia.org/r/800199 (https://phabricator.wikimedia.org/T309339) (owner: 10Samtar) [18:33:17] cool, let's see what breaks [18:33:19] !log Updated Jenkins beta-* job configs [18:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [18:33:46] I appreciate the attempts to work around the problems. It's very annoying [18:40:13] 10Release-Engineering-Team (Priority Backlog 📥), 10Release, 10Train Deployments: 1.39.0-wmf.14 deployment blockers - https://phabricator.wikimedia.org/T308067 (10Ladsgroup) ##### Risky Patch! 🚂🔥 * **Change**: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/778661 * **Summary**: ** Changes how maintenanc... [19:14:07] (03CR) 10BryanDavis: "dduvall: I think this is ready to go when you have time to merge and deploy." [blubber] - 10https://gerrit.wikimedia.org/r/789950 (https://phabricator.wikimedia.org/T296046) (owner: 10BryanDavis) [19:21:31] (03PS4) 10Jforrester: Stop branching LocalisationUpdate for the MediaWiki tarball [tools/release] - 10https://gerrit.wikimedia.org/r/790356 (https://phabricator.wikimedia.org/T300498) [19:21:34] (03CR) 10Jforrester: [C: 03+2] Stop branching LocalisationUpdate for the MediaWiki tarball [tools/release] - 10https://gerrit.wikimedia.org/r/790356 (https://phabricator.wikimedia.org/T300498) (owner: 10Jforrester) [19:26:19] (03Merged) 10jenkins-bot: Stop branching LocalisationUpdate for the MediaWiki tarball [tools/release] - 10https://gerrit.wikimedia.org/r/790356 (https://phabricator.wikimedia.org/T300498) (owner: 10Jforrester) [20:01:14] bd808: which remote logging protocol(s) does the beta cluster logstash instance accept? [20:01:43] ori: I can proudly say I have no idea :) [20:02:15] weeee [20:03:14] I think that kafka via rsyslog it the preferred mess these days [20:05:49] ack [21:06:27] thcipriani: hi Tyler, so .. you are group approver for the group "restricted" (deployment light). Would you approve https://phabricator.wikimedia.org/T309045 ? (Alex had originally suggested deployment group but I was bold and "downgraded" to restricted because all they need is mwmaint) [21:06:46] keyword: growth experiments [21:07:07] it's for "currently blocking T307454: May 23 – Export and upload welcome survey data" [21:07:08] T307454: May 23 – Export and upload welcome survey data - https://phabricator.wikimedia.org/T307454 [21:17:31] 10GitLab (Infrastructure), 10SRE, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) 05Open→03Resolved a:03Dzahn Notice: /Stage[main]/Ferm/Service[ferm]/ensure: ensure changed 'stopped' to 'running' (corrective) Info: /Stage[main]/Ferm/Service[ferm]: U... [21:27:29] (03PS3) 10Ahmon Dancy: WIP: Deploy mw image to clusters defined in config [tools/scap] - 10https://gerrit.wikimedia.org/r/789659 (https://phabricator.wikimedia.org/T299648) [21:30:41] mutante: Tyler is off today and tomorrow. [21:31:00] 10Continuous-Integration-Config: Set beta-code-update-eqiad to fail on timeout - https://phabricator.wikimedia.org/T309339 (10TheresNoTime) 05Open→03Resolved [21:31:37] dancy: ah, ok. thank you [22:07:19] ori: you might already know, but, you can cherry-pick puppet patches for beta directly without CR. if you need any access or pointers, feel free to ping :) [22:07:36] (it then gets merged later once you're happy with how it works) [23:14:02] 10GitLab (Infrastructure), 10SRE, 10serviceops: gitlab1004 - puppet cert revoked? - https://phabricator.wikimedia.org/T309259 (10Dzahn) Now using this machine for https://gerrit.wikimedia.org/r/c/operations/puppet/+/800308 and setting it active in netbox. [23:17:58] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) set gitlab1004, gitlab-runner1002/1003/1004, gitlab-runner2002/2003/2004 from staged to Active status in netbox. because meanwhile they have act... [23:22:56] 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops, 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) useful link to see which repos use the most space, provided by Brennen: https://gitlab.wikimedia.org/admin/projects?sort=storag... [23:26:40] 10GitLab (Infrastructure), 10Data-Persistence-Backup, 10serviceops, 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) >>! In T274463#7959305, @jcrespo wrote: > Yesterday's gitlab full backup was of only 42KB FYI. I would consider that a backup fa...