[09:01:43] 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10Antoine_Quhen) 05Resolved→03Open Hi, it's still happening now: * https://gitlab.wikimedia.org/repos/data-engineering/conda-base-env/-/jobs/21111 `E: You don't have enough free space in /var/cach... [12:21:21] 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10hashar) > ⚠ rerunning the job is not working currently. The issue is partly related to the jobs. If their builds take a lot of GigaBytes, specially in the cache they have a high chance of filing the r... [12:21:23] Hi gitlab people - we have reopened https://phabricator.wikimedia.org/T310593 as we're still experiencing problems with CI runners - Could you please re-invistigate? Many thanks in advance [12:22:33] 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10hashar) I have bring the topic to the Wednesday Gitlab sync meeting but this week it got used to upgrade Phabricator and we haven't had a chance to talk about it. The workaround is to manually prune t... [12:23:04] !log runner-1026: `docker volume prune -f` [12:23:05] hashar: Not expecting to hear !log here [12:23:08] what [12:23:14] stashbot: help [12:23:15] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [12:23:16] \o/ [12:23:51] too lazy to setup stashbot ther,e I will !log in #wikimedia-releng [12:24:41] joal: there might be some jobs that use a lot of disk space, I have seen some airflow/data build crafting a 5GB container [12:24:58] joal: and the saved caches might be too big, or at least they seem to accumulate ad vitam [12:25:02] on a per repo/job basis [12:25:06] but split per instances [12:25:18] We know that - the dependencies are big here [12:25:34] so eventually a repo/job cache would end up stored on each of the runner rather than shared [12:25:48] I wonder if there would be ways to reuse the built images [12:25:56] maybe gitlab has a way to NOT cache some materials (I have no clue how) [12:26:18] and if the artifacts are retrieved from a "local" source such as archiva.wikimedia.org, that is probably good enough [12:26:37] saves some disk used by the cache at the expense of IO/network to transfer them from the source [12:26:59] I don't think we use archiva artifacts here - only python deps [12:27:04] and I have ZERO idea how docker images are build when inside a gitlab runner process :-\ [12:27:14] I need to talk with ottomata about this - he's the one having set that up [12:28:30] he had an experiment with conda, which looked like yet another package manager, but I haven't dig that rabbit hole :] [12:29:15] yeah - my assumption is that the conda thing builds a new docker image every time, leading to no caching [12:30:51] thanks for the manual pruning hashar [12:31:13] [12:31:28] like `docker volume ls` shows the list of volumes [12:31:34] but there is no way to see their size [12:31:50] to do that you need to inspect the disk file system in verbose mode: `docker system df -v` [12:56:24] 10GitLab: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10hashar) Note to self, one can see the volumes size with: `docker system df -v`. On runner-1024 there are a few few 1 GB, 1.2G and 1.5G volumes piled up: ` runner-4kunvzhc-project-359-concurrent-0-cach... [15:12:02] taavi has setup Prometheus for the `gitlab-runners` WMCS project [15:12:03] https://grafana-cloud.wikimedia.org/d/8Npp-46Zz/project-overview?orgId=1&var-project=gitlab-runners [15:12:09] that is the overview of instances [15:12:27] per instance details can be retrieved from another dashboard here for runner-1025 https://grafana-cloud.wikimedia.org/d/000000590/instance-details?orgId=1&var-project=gitlab-runners&var-job=node&var-node=runner-1025 [15:12:37] other hosts can be access by filing their hostname in the Hosts: field [15:13:00] Grafana should show up a nice list, but that requires an upgrade of Prometheus which is known and will eventually happen [15:13:14] tldr: gitlab-runners instances now have metrics collected [15:14:32] excellent [15:15:57] I saw a couple of red error boxes (saying something about a template) when I visited https://grafana-cloud.wikimedia.org/d/000000590/instance-details?orgId=1&var-project=gitlab-runners&var-job=node&var-node=runner-1025 [15:16:04] they disappeared after a short while [15:22:22] yeah [15:22:43] so Grafana tries to reach some unknown endpoint in prometheus.wmflabs.org which thus does not set the CORS headers [15:22:49] and the browser rejects the reques [15:22:51] t [15:23:01] then the prometheus api end point do not exist anyway [15:23:16] so #wikimedia-cloud-admin told me they are working on upgrading Prometheus eventually which will solve that [15:23:29] so not perfect, but at least we have some kind of overview of the instances now! [15:23:52] I have bookmarked https://grafana-cloud.wikimedia.org/d/8Npp-46Zz/project-overview?orgId=1&var-project=integration [15:23:56] and https://grafana-cloud.wikimedia.org/d/8Npp-46Zz/project-overview?orgId=1&var-project=gitlab-runners [15:24:36] I don't think they show the /var/lib/docker disk usage though [15:24:51] then one can click the various runners to get their detailed metrics [15:25:38] for the cache filing up instance, they should probably be offloaded to Swift / S3 [15:26:18] anyway, I have added a few bits to the etherpad :) [15:27:39] We might want to employ the docker-gc tool that I built. [15:49:04] ^ [15:49:08] 10GitLab, 10Gitlab-Application-Security-Pipeline, 10Release-Engineering-Team: Gitlab pipeline not working with "docker-registry.wikimedia.org/releng/" images - https://phabricator.wikimedia.org/T310718 (10sbassett) Hey #release-engineering-team - guessing the `releng` path might need to be explicitly allow-l... [15:52:39] although i also think putting a lot of time into a wmcs-specific solution is probably wasted effort if we're going to move these to digitalocean. [15:57:55] 10GitLab, 10Gitlab-Application-Security-Pipeline, 10Release-Engineering-Team: Gitlab pipeline not working with "docker-registry.wikimedia.org/releng/" images - https://phabricator.wikimedia.org/T310718 (10brennen) Bit slammed at the moment but this probably dupes {T310535}. [16:00:20] 10GitLab, 10Gitlab-Application-Security-Pipeline, 10Release-Engineering-Team: Gitlab pipeline not working with "docker-registry.wikimedia.org/releng/" images - https://phabricator.wikimedia.org/T310718 (10sbassett) [16:00:55] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10Patch-For-Review, 10User-brennen: GitLab runners: allowed_images patterns need to be loosened to include subdirectories - https://phabricator.wikimedia.org/T310535 (10sbassett) [16:01:15] 10GitLab, 10Gitlab-Application-Security-Pipeline, 10Release-Engineering-Team: Gitlab pipeline not working with "docker-registry.wikimedia.org/releng/" images - https://phabricator.wikimedia.org/T310718 (10sbassett) >>! In T310718#8009379, @brennen wrote: > Bit slammed at the moment but this probably dupes {T... [16:22:51] 10GitLab, 10Data-Engineering: Experiencing pipeline failure due to disk-space issues - https://phabricator.wikimedia.org/T310593 (10JAllemandou) [16:38:41] The docker GC issues will follow us wherever the runners go. [16:39:20] fair. [18:18:38] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10Patch-For-Review, 10User-brennen: GitLab runners: allowed_images patterns need to be loosened to include subdirectories - https://phabricator.wikimedia.org/T310535 (10brennen) @sbassett: With that patch merged, this should take effect once runners... [18:19:20] 10GitLab, 10Gitlab-Application-Security-Pipeline, 10Release-Engineering-Team: Gitlab pipeline not working with "docker-registry.wikimedia.org/releng/" images - https://phabricator.wikimedia.org/T310718 (10brennen) From other task: > @sbassett: With that patch merged, this should take effect once runners are... [18:19:44] 10GitLab (CI & Job Runners), 10Release-Engineering-Team, 10Patch-For-Review, 10User-brennen: GitLab runners: allowed_images patterns need to be loosened to include subdirectories - https://phabricator.wikimedia.org/T310535 (10sbassett) >>! In T310535#8009874, @brennen wrote: > @sbassett: With that patch me... [18:51:30] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10Dzahn) [18:53:26] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gitlab-runner1001.eqiad.wmnet` - gitlab-runner1001.eqiad.w... [19:01:02] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin2002 for hosts: `gitlab-runner1001.eqiad.wmnet` - gitlab-runner1001.eqiad.w... [19:39:45] 10GitLab (Infrastructure), 10serviceops, 10Patch-For-Review: bring new gitlab hardware servers into production - https://phabricator.wikimedia.org/T307142 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aokoth@cumin1001 for hosts: `gitlab-runner2001.codfw.wmnet` - gitlab-runner2001.codfw.... [20:53:53] 10GitLab (CI & Job Runners), 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/806250 [21:50:14] 10GitLab (CI & Job Runners), 10Patch-For-Review, 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 (10Dzahn) buildkitd is now running on all (6) gitlab-runners. It's 6 because the VMs 1001 and 2001 have been... [22:00:02] 10GitLab (CI & Job Runners), 10Patch-For-Review, 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 (10dancy) Thank you very much @Dzahn ! [22:08:48] 10GitLab (CI & Job Runners), 10Patch-For-Review, 10Release-Engineering-Team (GitLab-a-thon 🦊), 10User-brennen: Deploy buildkitd to trusted GitLab runners - https://phabricator.wikimedia.org/T308271 (10Dzahn) p:05High→03Medium It's deployed but we have some follow-ups. I guess lowering the prio a bit is...