[07:06:22] 10GitLab (Infrastructure), 10serviceops-collab: Troubleshoot partman config for two additional disks on GitLab hosts - https://phabricator.wikimedia.org/T333674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jelto@cumin2002 for host gitlab1004.wikimedia.org with OS bullseye [07:45:33] 10GitLab (Infrastructure), 10serviceops-collab: Troubleshoot partman config for two additional disks on GitLab hosts - https://phabricator.wikimedia.org/T333674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jelto@cumin2002 for host gitlab1004.wikimedia.org with OS bullseye completed: -... [08:18:11] 10Deployments, 10Release-Engineering-Team, 10Patch-For-Review, 10Performance-Team (Radar): MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 (10Clement_Goubert) ` cgoubert@deploy1002:~$ PHP='php -d auto_prepend_file=/srv/mediawiki/wmf-config/P... [08:20:53] 10GitLab (Infrastructure), 10serviceops-collab: Let's Encrypt certificate expiration notice for domain gitlab.devtools.wmcloud.org - https://phabricator.wikimedia.org/T335161 (10Jelto) Thanks for the research @Dzahn ! I also looked at the test instance and discovered three issues: * some stale nginx process... [08:21:05] 10GitLab (Infrastructure), 10serviceops-collab: Let's Encrypt certificate expiration notice for domain gitlab.devtools.wmcloud.org - https://phabricator.wikimedia.org/T335161 (10Jelto) p:05Triage→03Medium [08:27:22] 10Release-Engineering-Team (Priority Backlog 📥), 10Patch-For-Review, 10Release, 10Train Deployments: 1.41.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T330211 (10jnuche) 05Open→03Resolved [08:34:06] 10Deployments, 10Release-Engineering-Team, 10Performance-Team (Radar): MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 (10Clement_Goubert) Everything looks good from my side, tell me when you have a chance to check and I can remove the backup... [08:34:27] 10Deployments, 10Release-Engineering-Team, 10serviceops, 10Performance-Team (Radar): MediaWiki deploy servers should not be mediawiki installation targets - https://phabricator.wikimedia.org/T329857 (10Clement_Goubert) [09:00:05] 10GitLab (Infrastructure), 10serviceops-collab: Troubleshoot partman config for two additional disks on GitLab hosts - https://phabricator.wikimedia.org/T333674 (10Jelto) 05Open→03Resolved `gitlab1004` was reimaged successfully and has a bigger `/srv/gitlab-backup` partition too. I run the following comma... [09:00:08] 10GitLab (Infrastructure), 10serviceops-collab: Define future design of GitLab backups - https://phabricator.wikimedia.org/T330172 (10Jelto) [13:33:52] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [13:34:05] 10Release-Engineering-Team (Seen), 10MW-on-K8s, 10SRE, 10Traffic, and 3 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120 (10Clement_Goubert) [14:13:10] 10Beta-Cluster-Infrastructure, 10CirrusSearch, 10Discovery-Search, 10Wikidata, 10wdwb-tech: Cirrus search is broken on beta (April 2023, second occurence) - https://phabricator.wikimedia.org/T335181 (10taavi) 05Open→03Resolved a:03taavi [14:33:18] 10Beta-Cluster-Infrastructure, 10Infrastructure-Foundations, 10Puppet: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10joanna_borun) [14:42:13] 10Project-Admins, 10Infrastructure-Foundations, 10PM, 10Puppet: Clarify Puppet tag - https://phabricator.wikimedia.org/T295221 (10joanna_borun) a:03joanna_borun [14:51:10] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for USER[S] - https://phabricator.wikimedia.org/T335297 (10JameelKaisar) [15:02:40] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for USER[S] - https://phabricator.wikimedia.org/T335297 (10jbond) users is a contractor on SRE infrastructure foundations ill vouch, but also cc @joanna_borun [15:09:36] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10Aklapper) [15:10:20] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10Aklapper) > * Reason for access - what projects are you planning to host or contribute to: Custom Library created for Network Performance Measurement (T332024) Hmm,... [15:10:44] 10GitLab (Infrastructure), 10serviceops-collab: Let's Encrypt certificate expiration notice for domain gitlab.devtools.wmcloud.org - https://phabricator.wikimedia.org/T335161 (10Dzahn) a:05Dzahn→03Jelto [15:14:50] (03CR) 10Jforrester: "Is this going to be a new Wikimedia production extension? You put it in the Wikimedia production section but there's no linked task so I d" [integration/config] - 10https://gerrit.wikimedia.org/r/908941 (owner: 10Lens0021) [15:15:02] 10GitLab (Project Migration), 10Release-Engineering-Team (Priority Backlog 📥), 10API Platform, 10Anti-Harassment, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10EBernhardson) [15:23:04] (03CR) 10Lens0021: Add new Extension:PageViewInfoGA (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/908941 (owner: 10Lens0021) [15:23:39] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10JameelKaisar) Not Exactly. I just require access to create a repo. I have to host some code on it. [15:32:05] (03PS1) 10Lens0021: Move down PageViewInfoGA from production section [integration/config] - 10https://gerrit.wikimedia.org/r/911330 [15:32:38] (03CR) 10Lens0021: Add new Extension:PageViewInfoGA (031 comment) [integration/config] - 10https://gerrit.wikimedia.org/r/908941 (owner: 10Lens0021) [15:34:33] 10Release-Engineering-Team, 10API Platform, 10AQS2.0, 10Platform Engineering, and 4 others: Define a procedure/pattern to populate test environments - https://phabricator.wikimedia.org/T334851 (10akosiaris) I am moving this one to #serviceops-radar as we are interested to see how this pans out, but I am no... [15:47:41] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10thcipriani) 05Open→03Resolved a:03thcipriani >>! In T335297#8801937, @Aklapper wrote: >> * Reason for access - what projects are you planning to host or contrib... [15:50:21] 10Phabricator, 10serviceops-collab, 10PM, 10Patch-For-Review: Merge the Phabricator Priority values "Low" and "Lowest" - https://phabricator.wikimedia.org/T228759 (10BCornwall) @Aklapper There was a ton of bikeshedding on wikitech-l which resulted in nothing getting done. :( I know you didn't solicit my o... [15:53:56] (03PS2) 10Jforrester: Zuul: [mediawiki/extensions/PageViewInfoGA] Move from production section [integration/config] - 10https://gerrit.wikimedia.org/r/911330 (owner: 10Lens0021) [15:54:02] (03CR) 10Jforrester: [C: 03+2] "Thank you!" [integration/config] - 10https://gerrit.wikimedia.org/r/911330 (owner: 10Lens0021) [15:55:10] (03Merged) 10jenkins-bot: Zuul: [mediawiki/extensions/PageViewInfoGA] Move from production section [integration/config] - 10https://gerrit.wikimedia.org/r/911330 (owner: 10Lens0021) [15:56:54] 10Beta-Cluster-Infrastructure, 10Puppet: Unduplicate beta cluster hiera keys set both in Horizon and in ops/puppet - https://phabricator.wikimedia.org/T277680 (10Aklapper) @joanna_borun: Herald added back the project tag due to H389 / T285143. If you'd like that changed, please file a separate ticket - thanks! [17:07:01] thcipriani: hi, do you know about MW extensions being moved from gerrit to gitlab? see most recent backlog in -operations [17:08:48] Dreamy_Jazz: well, this is an exampe of the ProofRead page extenion: [17:08:50] https://gerrit.wikimedia.org/r/q/project:mediawiki%252Fextensions%252FProofreadPage+status:open [17:09:05] at some point in the past Gerrit URLs changed format [17:09:10] could it be that too? [17:09:21] like /r/ vs /r/q/ /r/p/ [17:09:27] mutante: I think you should ask the actual question [17:10:07] RhinosF1: the actual question is.. what is the response to 17:02 < Dreamy_Jazz> Anyone seeing test failures with messages like "fatal: unable to access 'https://gerrit.wikimedia.org/r/mediawiki/extensions/GeoData/': The requested URL returned error: 502" [17:10:26] to me that is a 404 and not a 502 [17:10:32] I don’t think the issue has anything to do with a gitlab migration because to my knowledge we’re not switch mediawiki repos yet [17:10:39] but also I dont know why it would be used by zuul if it's 404 [17:10:45] It is also a 404 for me, but zuul clone said 502 [17:10:48] Dreamy_Jazz: have you tried a recheck [17:11:01] I'm waiting for gate-and-submit to finish failing before I can try again [17:11:04] I’m not sure the clone url has ever shown anything user facing [17:11:30] It’s quite possibly just a transient issue [17:11:40] RhinosF1: ack, but to my knowledge there was also effort to start moving mw extensions.. but then probably unrelated [17:11:42] If recheck doesn’t fix it Dreamy_Jazz, try again [17:11:50] Thanks [17:11:57] mutante: I can’t find the ticket but that would require a very big announcement [17:12:04] Not something to just happen [17:12:14] I just dont get why the URL makes sense, if it's either 5xx or 4xx :) [17:12:30] mutante: the 4xx I think is gerrit weird layout and expected [17:12:42] The 5xx is a probably gerrit went weird for a minute [17:12:44] might be zuul config and that gerrit URL format change was my next thought [17:12:52] yes, RhinosF1, agreed [17:15:01] mutante: the url format is fine for anon http [17:15:18] RhinosF1: mutante: the canonical Gerrit repo urls are https://gerrit.wikimedia.org/r/ [17:15:19] I don’t think gerrit has ever shown something in the UI for the https clone page [17:15:44] hashar: I think gerrit went weird for a second and produced a 5xx [17:15:50] not sure why there will be a trailing slash, Apache probably consider them to fail [17:15:51] Then mutante forgot how gerrit worked [17:15:59] RhinosF1: aha, thanks [17:16:23] the Gerrit errors are in logstash if further investigation is needed :] [17:16:28] hashar: ack yea, but some time in the past we had /r/p/ or something, afair. no worries, thanks [17:16:34] Recheck is probably best [17:16:43] More looking can be done if that fails [17:16:46] +1 [17:16:47] yeah we had /r/p/ a while ago but all those usage should have been migrated [17:16:47] I have to leave train [17:17:02] I think it was a change made in preparation for the 2.16 > 3.2 upgrade [17:17:59] ok, thanks, go back to your vacation, hashar. [17:18:06] :] [17:27:23] 10Phabricator, 10serviceops-collab, 10PM, 10Patch-For-Review: Merge the Phabricator Priority values "Low" and "Lowest" - https://phabricator.wikimedia.org/T228759 (10thcipriani) >>! In T228759#8789350, @Aklapper wrote: >>>! In T228759#8718415, @gerritbot wrote: >> aklapper opened https://gitlab.wikimedia.o... [17:34:25] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10jbond) > We have users marked as external by default at the moment. Maybe we should choose a different term for these automatic tickets. are all users marked default... [17:36:47] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10brennen) > are all users marked default (including users with a wikimedia.org email address) wikimedia.org and wikimedia.de are marked internal by default, assuming... [17:37:25] 10Phabricator, 10serviceops-collab, 10PM, 10Patch-For-Review: Merge the Phabricator Priority values "Low" and "Lowest" - https://phabricator.wikimedia.org/T228759 (10brennen) Configuration changes of this type should require very little (if any) user-facing downtime. I'll do the necessary with the deployme... [17:51:36] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10JameelKaisar) Shouldn't it be **"@(wikimedia.org|wikimedia.de)$"** instead of **"@(wikimedia.org|wikimedia.de)\.com$"**? [18:01:50] 10GitLab (Auth & Access), 10Release-Engineering-Team: Requesting full access to GitLab for jameel - https://phabricator.wikimedia.org/T335297 (10brennen) > Shouldn't it be "@(wikimedia.org|wikimedia.de)$" instead of "@(wikimedia.org|wikimedia.de)\.com$"? Good catch. :) [18:08:07] 10Phabricator, 10Release-Engineering-Team, 10serviceops-collab: Replace existing aphlict1001 with puppet-managed bullseye host - https://phabricator.wikimedia.org/T333452 (10brennen) [19:05:40] maintenance-disconnect-full-disks build 485373 integration-agent-docker-1038 (/: 28%, /srv: 99%, /var/lib/docker: 37%): OFFLINE due to disk space [19:09:22] 10GitLab (Project Migration), 10Phabricator, 10Release-Engineering-Team (Priority Backlog 📥), 10serviceops-collab, and 3 others: Migrate active repositories in Phabricator Differential to GitLab - https://phabricator.wikimedia.org/T191182 (10Aklapper) @thcipriani: Could you answer T191182#8722238 please? T... [19:10:36] maintenance-disconnect-full-disks build 485374 integration-agent-docker-1038 (/: 28%, /srv: 79%, /var/lib/docker: 34%): RECOVERY disk space OK [19:45:37] 10Phabricator, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: Replace existing aphlict1001 with puppet-managed bullseye host - https://phabricator.wikimedia.org/T333452 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=03511766-314f-4a97-9387-a1bbedb36faa) set by eog... [19:53:41] 10Phabricator, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review: Replace existing aphlict1001 with puppet-managed bullseye host - https://phabricator.wikimedia.org/T333452 (10eoghan) @brennen and I worked on this today. The new aphlict1002 host is currently active (after [[ https://gerri... [19:55:02] 10Phabricator, 10Release-Engineering-Team, 10serviceops-collab, 10Patch-For-Review, 10User-brennen: Replace existing aphlict1001 with puppet-managed bullseye host - https://phabricator.wikimedia.org/T333452 (10brennen) [20:01:15] Hi releng friends, we've put a new aphlict host in place. It seems to be working ok, and I'm pretty confident there'll be no problems, but I've included revert instructions in the task above just out of caution. If there's anything weird, please let me know (: [20:01:31] Thanks to brennen for the help! [20:02:12] great work!:) one thing off the buster list [20:15:47] maintenance-disconnect-full-disks build 485387 integration-agent-docker-1031 (/: 28%, /srv: 98%, /var/lib/docker: 48%): OFFLINE due to disk space [20:20:34] maintenance-disconnect-full-disks build 485388 integration-agent-docker-1031 (/: 28%, /srv: 74%, /var/lib/docker: 48%): RECOVERY disk space OK [20:30:32] Is there a GitLab-world idea re. the pipeline tagging commits, tagging images with that commit, and auto-proposing a deployment-charts commit bumping the helm chart pointer to the new tag? It doesn't do it currently, and I can't find a task in Phab, but before filing a task I wanted to check if this was actually intentional. [20:32:27] 10Continuous-Integration-Config, 10Growth-Team, 10PageTriage: PageTriage QUnit continuous integration is broken - https://phabricator.wikimedia.org/T335315 (10Novem_Linguae) [20:32:48] seems likely a thing that just hasn't been implemented yet. That was all built out for pipelinelib previously right? I guess unless in practice nobody actually used that workflow... [20:33:00] Yeah. [20:33:29] 10Continuous-Integration-Config, 10Growth-Team, 10PageTriage: PageTriage QUnit continuous integration is broken - https://phabricator.wikimedia.org/T335315 (10Novem_Linguae) [20:33:33] bd808: Well, I think lots of people used it but "lots" of "incredibly small number of times services get new releases" is ~0. :-( [20:35:51] I'm interested in fully automated deployment of blubber->container image->prod k8s for things like developer-portal. It annoys me that I have that automatic workflow for the Cloud VPS deployment (via podman-auto-update) but that it takes 30 minutes of wall clock time and manual steps in prod. [20:36:10] Yeah, CD for these things would be nice. [20:37:39] 10Continuous-Integration-Config, 10Growth-Team, 10PageTriage: PageTriage QUnit continuous integration is broken - https://phabricator.wikimedia.org/T335315 (10jsn.sherman) Looking at the mobilefrontend extension, I can see there is some extra setup they are doing for qunit: https://gerrit.wikimedia.org/r/plu... [20:40:07] IIRC, this previously worked for the pipeline [20:40:24] Does https://gitlab.wikimedia.org/repos/releng/blubber/-/blob/main/.gitlab-ci.yml#L45 do the needful? [20:40:47] https://gitlab.wikimedia.org/repos/releng/kokkuri#publish-an-image-variant [20:40:51] Not in our experience, we just had to drop it. [20:41:13] But certainly there's no auto-commit like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/908882 proposed. [20:42:45] yeah, we publish an image, but we no longer automatically update the helm chart values (is my memory) I talked about this with jeena at one point, but it was tricky for Reasons™ that are no longer top of my mind [20:43:04] .kokkuri:deploy-image is maybe the (undocumented) thing: https://gitlab.wikimedia.org/repos/releng/kokkuri/-/blob/main/includes/images.yaml#L132 [20:43:24] we may have ended up turning it off on the old pipline as well due to noise: https://gerrit.wikimedia.org/r/q/topic:pipeline-promote [20:44:38] James_F: for clarity, are you asking specifically about the chart autobump? Or + chart autobump? [20:45:18] thcipriani: The former I think, but the latter maybe unless kindrobot thinks otherwise. [20:45:56] 10GitLab, 10Release Pipeline (Blubber): Provide a mechanism on GitLab to more easily deploy newly-created images to Wikimedia production - https://phabricator.wikimedia.org/T335316 (10Jdforrester-WMF) [20:46:00] Filed ^ [20:46:43] that kokkuri deploy image thing that bd808 found runs helm in the "staging" namespace on the production k8s and then runs "helm test" against the deployment [20:46:49] .kokkuri:deploy-image looks like it only knows how to do things in a "CI_STAGING" environment per https://gitlab.wikimedia.org/repos/releng/kokkuri/-/blob/main/lib/image.py#L339 (and it doesn't do this via modifuying charts) [20:46:56] Ooh. [20:47:05] yeah, that :) [20:47:13] Shouldn't it go through the chart though? [20:47:25] (This is getting beyond my knowledge of this stuff.) [20:47:50] it probably uses the chart and modifies the image value on the command line [20:47:50] It overloads the chart's main_app.version with a tmp file [20:47:56] that was meant as a little used gate-and-submit smoke test in the old pipeline world. [20:48:00] Right. [20:48:42] * bd808 waves to jeena [20:48:44] that is, most charts used "service-checker" to ensure that the new images were doing things like responding with a 200 when someone hits / [20:48:56] hi bd808 :) [20:49:36] but the automagic chart-bumping is something... jeena can say smarter things than I can about :) [20:49:42] Ack. [20:50:19] I'm sure we could find a way to implement that, but would need gerrit credentials on gitlab or something since the charts repo is still in gerrit [20:50:24] or move it to gitlab? [20:52:38] I guess we used whatever pipelinelibbots creds were in the old jenkins jobs. Could, I suppose, use those same credentials in some kind of post-merge job ¯\_(ツ)_/¯ [20:52:54] Moving deployment-charts to GitLab makes sense. [20:53:05] But SRE ServiceOps might be wary? [20:53:37] but I did think most people found that updating action annoying eventually 😅 Although with gitlab ci we could make it run on a button push [20:53:54] That'd be a neat improvement. [20:53:58] (Sorry for making more work.) [20:54:09] * thcipriani likes buttons [20:55:42] I think it'd just be configuration in the gitlab-ci.yml [21:15:57] I've also seen it done where pushing a git tag (e.g. v1.0.1) triggers a docker build with the same tag, and then a push is done to a helm values.yaml which triggers a helm upgrade in CI. [21:16:41] Helm folks recommend using semantic version tags in values.yaml instead of latest so that you can reason about a rollback. [21:25:23] 10GitLab (CI & Job Runners), 10Release-Engineering-Team: Consider adding "official" node images to list of allowed images for gitlab runners - https://phabricator.wikimedia.org/T335320 (10Mhurd) [21:28:24] 10Beta-Cluster-Infrastructure: beta-scap-sync-world failure — 'No space left on device' - https://phabricator.wikimedia.org/T327853 (10TheresNoTime) a:03TheresNoTime [21:28:38] 10Phabricator, 10I18n: Typo in Phabricator (Audit) in English - https://phabricator.wikimedia.org/T312053 (10Pppery) [21:28:49] 10Phabricator, 10I18n: Typo in Phabricator (Arcanist) in English - https://phabricator.wikimedia.org/T312052 (10Pppery) [21:28:53] 10Phabricator, 10I18n: 2 simple typos in English for Phabricator (Differential) - https://phabricator.wikimedia.org/T312051 (10Pppery) [21:36:03] 10Continuous-Integration-Config, 10Growth-Team, 10PageTriage: PageTriage QUnit continuous integration is broken - https://phabricator.wikimedia.org/T335315 (10jsn.sherman) Looks like this was a false alarm on my part? [21:37:40] !log (deployment-prep) Add volume to `deployment-mwmaint02` (`mwmaint`) and moved `/srv` - T327853 [21:37:44] Logged the message at https://wikitech.wikimedia.org/wiki/Release_Engineering/SAL [21:37:44] T327853: beta-scap-sync-world failure — 'No space left on device' - https://phabricator.wikimedia.org/T327853 [21:39:34] 10Beta-Cluster-Infrastructure: beta-scap-sync-world failure — 'No space left on device' - https://phabricator.wikimedia.org/T327853 (10TheresNoTime) 05Open→03Resolved ` samtar@deployment-mwmaint02:~$ df -h Filesystem Size Used Avail Use% Mounted on udev 2.0G 0 2.0G 0% /dev tmpfs... [21:41:18] 10Continuous-Integration-Config, 10Growth-Team, 10PageTriage: PageTriage QUnit continuous integration is broken - https://phabricator.wikimedia.org/T335315 (10jsn.sherman) 05Open→03Invalid I raised this issue while apparently being unable to read the word 'qunit' on my screen; apologies for the false alarm!