[04:41:10] releases1002 is full on its /srv/docker: https://phabricator.wikimedia.org/T288024 [05:30:44] marostegui: did anyone clean it? [05:31:10] joe: I haven't [05:31:15] ack [05:31:23] joe: I don't really know what can or cannot be deleted there [05:33:38] marostegui: yeah don't worry, I'll act on it [05:33:56] grazie [05:51:47] FYI; I opened a task to propose dropping the "Long running screen/tmux" alert. if you have opinions in either direction please follow up on https://phabricator.wikimedia.org/T288028 [05:52:50] I can't think of a single time this wasn't alerting on Icinga and can't think of a single time in recent months where this found a legitimate issue worth alerting about [05:59:40] moritzm: I think i first proposed to drop it the week after it was introduced :} [06:03:35] ha :-) [08:17:32] so, I'm trying out jbond's work https://gerrit.wikimedia.org/r/c/operations/puppet/+/692286 on my self-hosted puppetmaster, but I get this error: [08:17:35] Error while evaluating a Resource Statement, Evaluation Error: Unknown function: 'puppetdb_query' [08:18:07] I see others on the internet had the same problem, eg: https://puppet-users.narkive.com/29efkKcx/puppet-users-puppetdb-query-missing-from-agent and https://tickets.puppetlabs.com/browse/PDB-3655 [08:18:45] anyone has an idea about where this function may be coming from, and how to get it on my wmfcloud puppet environment? [08:19:05] do you have puppetdb setup on your self-hosted puppetmaster? [08:21:12] https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster/PuppetDB [08:21:47] majavah: thanks [08:39:37] obligatory pontoon plug, please reach out if you'd like to try an easier way for puppetmaster + puppetdb in general [09:40:59] godog, were you possibly doing some prometheus maintenance around 08:08 ? [09:41:40] jynus: yes that's correct, I upgraded prometheus on prometheus1003 [09:41:55] the "bad certificate" error came back at that time "Aug 04 08:08:57 backup1004 minio[2621]: http: TLS handshake error from 10.64.0.123:41564: remote error: tls: bad certificate" [09:42:03] for minio [09:42:16] was something changed on certs or something? [09:42:39] not afaik [09:43:07] let me research- the issue was fixed because I forgot to add the fqdn on the request [09:43:14] but it came back at that time [09:43:18] on all hosts [09:44:08] interesting, it is possible prometheus changed behaviour if/when sending certs [09:44:21] let me know what you find, I can help too in case [09:44:28] but if it was a larger issue, there are other ssl checks [09:44:41] task for the upgrade is https://phabricator.wikimedia.org/T222113 FWIW [09:44:42] so that is weird [09:44:53] thanks, that is exactly what I needed [09:45:08] I will debug like last time and see what I can find [09:45:31] could be another issue I created but only shows on maintenance or something [09:52:55] I am confused- curl works and the configuration hasn't changed [09:54:32] but I think it has to be something on prometheus because it didn't failed on each prometheus host at the same time [09:57:34] yeah pretty sure validation on the prometheus side must have changed [09:57:47] :-( [09:58:03] even if I run curl as the prometheus user, the scrapping works [09:58:24] but I'm confused as to why availability is 0%, is prometheus1004 failing too ? [09:58:43] ah no nevermind, I know why [09:58:44] it is not, at least I only get errors from 3 on log [09:59:07] yeah that makes sense, I'll take a look [10:00:12] maybe it is a simple thing like a race condition on puppet run and the puppet cert wasn't loaded? [10:00:36] my guess on what is different is that the other https endpoints use certgen [10:00:54] or other cert [10:01:52] let me report my findings on that ticket- this is not critical for me at the time (I am not running backups right now) [10:02:51] SGTM [10:03:20] so plese don't leave everyhing for this, but this will be important later on [10:03:49] jynus: FYI the error on the prometheus side is indeed with the cert [10:03:51] Get "https://backup1004.eqiad.wmnet:9000/minio/v2/metrics/cluster": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0 [10:04:05] I'll update the task as well and then need to run to lunch [10:04:07] ah [10:04:28] I don't mind if I have to change how I do certs on client [10:04:28] elukey: the ROCM repository key is expired, which blocks repo updates, could you figure out what the new key is and update it? [10:05:53] jynus: yeah looks like SAN is a requirement now [10:07:38] cool, thanks that is super-helpful [10:07:56] moritzm: sure, do you need it now (namely blocking things) or after lunch is ok? [10:08:09] we can talk later, I don't mind adapting on my side, I just had to know the new requirements 0:-) [10:08:30] jynus: sounds good, thank you! ttyl [10:09:06] should be https://repo.radeon.com/rocm/apt/debian/dists/xenial/ in theory [10:09:07] elukey: thanks! any time today is fine [10:09:43] this is needed for the gitlab import, but can wait a few hours for sure [11:26:49] as heads up: I'll be running docker pull tests in eqiad again. This time from 73 parallel servers max (excluding the ones causing sessionstore alerts this time ;-)) [12:47:51] moritzm,jelto - updated the ROCm gpg key, and tried "reprepro --noskipold --component thirdparty/gitlab checkupdate buster-wikimedia" on apt1001, all works afaics [12:48:19] (no errors related to ROCm etc.. displayed) [12:48:32] I hope the upgrade is unblocked, lemme know otherwise! [13:22:05] jayme: heads up, I'm disabling puppet on all appservers [13:22:11] applying a change to apache [13:22:39] joe: okay. I'm just running my stuff in eqiad, no puppet changes planned [13:22:48] ack [13:22:53] very curious about the results [13:22:55] btw [13:23:02] use a recent mw image if possible [13:24:25] I actually wanted to use the same as for all other tests [14:34:40] jynus: re: minio certs yeah I think the simplest (maybe not easiest) is to use certs with fqdn in SAN, that's my understanding of what would make it work [14:35:52] in other words have the fqdn in SAN in addition to (or replace) CN [14:49:13] godog: are we running minio? [14:49:21] I wasn't aware :) [14:49:24] moritzm: I've just added my key to pwstore ; can you add me to .users please? [if you'd rather I emailed about this, happy to... ] [14:51:28] joe: indeed, used for object storage backups only afaik [14:51:47] oh ok :) [14:54:33] Emperor: sure thing, I've added your key and re-encrypted the secrets, let me know if you should run into any issues [14:55:30] Emperor`: be very very glad we no longer require cross-signing of gpg keys via public key servers. that was hell. [14:55:46] stupid power blip. [14:56:15] kormat: I just signed my WM ID with my Debian key :) [15:02:50] moritzm: trying to fetch one of the trusted keys isn't working (I got yours fine): "gpg: key 37E9B5C6F5F6A067: new key but contains no user ID - skipped" [15:03:02] when I do gpg --recv-keys B049B180212E42A7AFB4D43337E9B5C6F5F6A067 [15:04:54] these days the keys are all fetched from the keys/ directory within the repo, but most older keys are still on the SKS key servers [15:05:15] pwstore from wmf-sre-laptop is patched to fetch them from the keys directory [15:05:41] the 37E9B5C6F5F6A067 is the one by mutante [15:05:48] (Daniel Zahn) [15:09:33] moritzm: right; I did pws update-keyring (seems OK) but pws ed management (or anything else) says "Warning: gpg returned non-zero exit status 2 when decrypting rt." [15:12:17] ...but I can decrypt secrets by hand from that repo [15:13:33] (my pws is from wmf-sre-laptop) [15:16:29] hmmh, not sure what that is, works for me. but if you can "gpg --decrypt rt" the addition of your key and the re-encryption worked fine [15:16:41] yeah, that works. [15:17:09] maybe any non-standard options in gpg.conf messing with what pwstore expects to parse or so, not sure [15:17:20] gpg, so easy to use [15:18:46] Huh, I straced it and it worked. [15:18:58] it performs under pressure [15:19:25] ...and now it works without strace. This is going to be a gpg race, isn't it? [15:20:12] Anyhow, now seems to be behaving itself [15:24:24] it might be ad hoc gpg agent that gets started since gpg 2.0? if you ran gpg --decrypt manually in between, the agent will now be around. maybe try killing it and then re-run "pws ed rt" to see if you can repro? [15:24:34] joe: numbers look nice https://people.wikimedia.org/~jayme/pulltiming_dragonfly_73_nodes.html [15:25:41] moritzm: that's it, yes - if I kill the gpg agent, I then get the failure mode again. And then if I decrypt something by hand, that restarts the agent and then pws works again [15:27:29] Emperor: I'm pleased to see the pwstore setup is as usable as ever :) welcome aboard [15:27:50] proper linux as desktop experience [15:31:05] jayme: most importantly, you didn't bring down production [15:31:18] ok, that explains it. if there's an option to force the startup of the agent as part of the gpg invocation in pws, we can patch this in wme-sre-laptop (or submit to https://github.com/weaselp/pwstore/ where our patched version is based on) [15:33:31] marostegui: soft ping regarding https://phabricator.wikimedia.org/T263127 - looking at planning for August, do you expect this to happen in August? [15:34:11] Krinkle: I might but only for s6 [15:37:19] (oops, on rereading, that joke came off reading a lot meaner than I meant it to -- sorry about that) [15:38:18] rzl: and we were here, hopeful to have finally corrupted your soul [15:41:13] jayme: that's a nice use of plotly! Did you add the plot invocation there yourself or was that generated with a UI of sorts? Seems like that might save me some trouble if the latter (and it's nicely standalone it seems, no network reqs) [15:43:22] Krinkle: it's just a bit of potentially horrible python scripting https://phabricator.wikimedia.org/P15954 [15:44:33] joe: yeah - and that :-) registry traffic does peak around 10MB/s still while overall traffic in the P2P network peaks at ~1.5GB/s [15:45:10] jayme: oh, so there's a plotly python lib that generates a standalone html file for you? that's pretty cool. [15:45:23] I was looking around in their online UI if there was like an "Export as html file" option, which there doesnt' appear to be. [15:45:48] Krinkle: yeah. The files are pretty big, though. [15:46:19] the whole page is smaller than a typical favicon.ico file though [15:46:30] but yeah from a clean html perspective, it's large. [15:46:57] what? :D the page is 3.4M [15:46:57] (which, of course, says more about favicons than plotly, but alas..) [15:47:06] I hope no favicon is that big [15:47:19] jayme: oh, wow, okay I misread that. Yes, that is big. [15:47:57] it's 0.9M on the wire. I recall that being smaller in the past.. [15:48:00] * Krinkle checks other plots [15:48:20] I was thinking something like 100K or so [15:49:04] darn, no, their "regular" distribution is that size indeed now [15:49:06] https://cdn.plot.ly/plotly-2.2.0.min.js [16:08:40] jayme: I'm currently having a problem with 'docker push' of a large image to the registry. It uploads about 5GB of the 6GB, then says "Retrying in 10 seconds", then it restarts. It has done this several times so far. [16:08:43] No error message [16:09:37] not sure jayme's still around [16:10:33] actually I'm just about to leave. But feel free to write something down somwhere and I can take a look/reproduce tomorrow [16:11:05] Will do. Thanks. [16:11:06] if that's okay...if it's urgend, I can potentially look in a couple of hours [16:11:46] o/ [16:12:47] dancy: can you retry now? [16:13:18] It's running now [16:13:42] and it stopped and is in the retry wait again... and now pushing again.. [16:14:35] I can't find a good reason in the logs of the registries [16:15:06] ok found one [16:17:07] on the sending side (dockerd): msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error" [16:17:29] yeah I think the problem is at the nginx layer [16:17:46] I was leaning toward blaming NGINX [16:18:22] hah [16:18:23] 2021/08/04 16:13:19 [crit] 24078#24078: *4325 pwrite() "/var/lib/nginx/body/0000001575" failed (28: No space left on device), client: 10.64.48.17, server: , request: "PATCH [16:18:25] /v2/restricted/mediawiki-multiversion/blobs/uploads/857c8f0d-93f8-4af8-a530-147a057ae3ff?_state=2ZLT3RW5zpsfQhQJu-KVnsohlzknQowPCgsXpRMH_UR7Ik5hbWUiOiJyZXN0cmljdGVkL21lZGlhd2lraS1tdWx0aXZlcnNpb24iLCJVVUlEIjoiODU3YzhmMGQtOTNmOC00YWY4LWE1MzAtMTQ3YTA1N2FlM2ZmIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDIxLTA4LTA0VDE2OjA1OjAyLjY0ODY0MDM4MVoifQ%3D%3D HTTP/1.1", host: "docker-registry.discovery.wmnet" [16:19:06] that'll do it [16:20:36] try now :) [16:22:29] restarting... [16:22:34] we had just 5 G of space left on disk [16:23:30] because we started caching blobs in nginx (only binary, sha256-based urls, so it's unique urls and we don't have to worry about evictions, btw) [16:24:17] so I just went and took the Attila approach: find . -type f -delete [16:24:30] Seek and destroy [16:25:58] I'm going to take a break before the two hours of meeting in which I'm presenting stuff 😱 [16:26:21] ok! [16:31:48] failed again. :-( [16:35:01] dancy: sigh, I was mistaken. /var/lib/nginx is a separate directory [16:35:53] err partition [16:35:56] and it's 1 GB [16:36:03] so I guess one layer got over that size [16:36:06] 👀 [16:36:17] yeah no idea why it was done this way [16:37:03] joe: knowing ~partman~, i'd be willing to bet this could be a recipe not doing what someone expected [16:37:25] kormat: possibly; for now I'll just unmount it [16:37:33] dancy: gimme 5 minutes more [16:37:55] thx joe. I'm in a meeting now and the retries are happening automatically so I'm not stuck at the moment. [16:39:21] joe: /var/lib/nginx is _tmpfs_ [16:39:34] oh man. [16:40:02] kormat: yeah, in the meantime I've noticed [16:40:10] that was likely done for performance [16:40:24] so I am thinking of enlarging it a bit for now [16:43:53] ok bumped to 2 G now [16:44:29] inb4 dancy's 6G image has 1 layer which is 2.0001 GB [16:46:18] kormat: no the layer adapts to the size of buffers in nginx [16:46:33] :D [16:53:40] success! [16:55:56] (scare quotes implied) [16:56:14] dancy: I'm relieved, it just took me 3 attempts to get it right [17:39:07] one last question, joe, sorry, is there any delay between current mw deploy and image rebuild? I guess that will take time? [17:40:05] jynus: look at sal, the delay between the scap deploy and the deploy to k8s by "deploy-mwdebug" [17:40:18] thanks, joe [17:40:48] that's better than the theoretical answer [17:40:58] sometimes it will happen before scap even [17:41:13] yeah, I was to ask you to just point me to code, but that is better :-D [17:41:19] *going [17:46:24] 15 minutes or so, less than a full scap, but I guess it is unfair for scap in terms of # of nodes [17:46:51] surprisingly fast IMHO for such complexity still [17:46:58] good work [17:51:02] it should take less, actually, it will improve over time [17:51:46] today we hit an issue with the docker registry, so that could be a factor [18:10:26] The incremental image build process is live now so commits to operations/mediawiki-config and mediawiki/core (and submodules) should result in a new image within 2 minutes (For small changes,.. things that don't require l10n rebuild) [18:44:08] dancy: so I think we'll start deploying to k8s before scap happens :P [18:45:50] dancy: is there any overview of the process for us that can't join wmf-only meetings? [18:46:16] joe: Can we share the slide deck ? [18:46:41] dancy: sure, once we cleaned out all the compromising comments :D [18:46:47] haha [18:46:48] ok [18:46:56] we just can't put it on commons sadly [18:47:01] because I used non-free images [18:47:24] :/ [18:49:16] majavah: https://people.wikimedia.org/~oblivian/An%20introduction%20to%20mw%20on%20kubernetes.pdf [18:49:44] majavah: time constraints made me search the first cake slice image I could find [18:57:02] joe: wikitech doesn't have an fair-use violation police. feel free to upload there if you would like [18:59:20] maybe tomorrow, now it's dinner time, I've had enough work for today