[04:41:10] <marostegui>	 releases1002 is full on its /srv/docker: https://phabricator.wikimedia.org/T288024
[05:30:44] <joe>	 marostegui: did anyone clean it?
[05:31:10] <marostegui>	 joe: I haven't
[05:31:15] <joe>	 ack
[05:31:23] <marostegui>	 joe: I don't really know what can or cannot be deleted there
[05:33:38] <joe>	 marostegui: yeah don't worry, I'll act on it
[05:33:56] <marostegui>	 grazie
[05:51:47] <moritzm>	 FYI; I opened a task to propose dropping the "Long running screen/tmux" alert. if you have opinions in either direction please follow up on https://phabricator.wikimedia.org/T288028
[05:52:50] <moritzm>	 I can't think of a single time this wasn't alerting on Icinga and can't think of a single time in recent months where this found a legitimate issue worth alerting about
[05:59:40] <joe>	 moritzm: I think i first proposed to drop it the week after it was introduced :}
[06:03:35] <moritzm>	 ha :-)
[08:17:32] <ema>	 so, I'm trying out jbond's work https://gerrit.wikimedia.org/r/c/operations/puppet/+/692286 on my self-hosted puppetmaster, but I get this error:
[08:17:35] <ema>	  Error while evaluating a Resource Statement, Evaluation Error: Unknown function: 'puppetdb_query'
[08:18:07] <ema>	 I see others on the internet had the same problem, eg: https://puppet-users.narkive.com/29efkKcx/puppet-users-puppetdb-query-missing-from-agent and https://tickets.puppetlabs.com/browse/PDB-3655
[08:18:45] <ema>	 anyone has an idea about where this function may be coming from, and how to get it on my wmfcloud puppet environment? 
[08:19:05] <majavah>	 do you have puppetdb setup on your self-hosted puppetmaster?
[08:21:12] <majavah>	 https://wikitech.wikimedia.org/wiki/Help:Standalone_puppetmaster/PuppetDB
[08:21:47] <ema>	 majavah: thanks
[08:39:37] <godog>	 obligatory pontoon plug, please reach out if you'd like to try an easier way for puppetmaster + puppetdb in general
[09:40:59] <jynus>	 godog, were you possibly doing some prometheus maintenance around 08:08 ?
[09:41:40] <godog>	 jynus: yes that's correct, I upgraded prometheus on prometheus1003
[09:41:55] <jynus>	 the "bad certificate" error came back at that time "Aug 04 08:08:57 backup1004 minio[2621]: http: TLS handshake error from 10.64.0.123:41564: remote error: tls: bad certificate"
[09:42:03] <jynus>	 for minio
[09:42:16] <jynus>	 was something changed on certs or something?
[09:42:39] <godog>	 not afaik
[09:43:07] <jynus>	 let me research- the issue was fixed because I forgot to add the fqdn on the request
[09:43:14] <jynus>	 but it came back at that time
[09:43:18] <jynus>	 on all hosts
[09:44:08] <godog>	 interesting, it is possible prometheus changed behaviour if/when sending certs
[09:44:21] <godog>	 let me know what you find, I can help too in case
[09:44:28] <jynus>	 but if it was a larger issue, there are other ssl checks
[09:44:41] <godog>	 task for the upgrade is https://phabricator.wikimedia.org/T222113 FWIW
[09:44:42] <jynus>	 so that is weird
[09:44:53] <jynus>	 thanks, that is exactly what I needed
[09:45:08] <jynus>	 I will debug like last time and see what I can find
[09:45:31] <jynus>	 could be another issue I created but only shows on maintenance or something
[09:52:55] <jynus>	 I am confused- curl works and the configuration hasn't changed
[09:54:32] <jynus>	 but I think it has to be something on prometheus because it didn't failed on each prometheus host at the same time
[09:57:34] <godog>	 yeah pretty sure validation on the prometheus side must have changed
[09:57:47] <jynus>	 :-(
[09:58:03] <jynus>	 even if I run curl as the prometheus user, the scrapping works
[09:58:24] <godog>	 but I'm confused as to why availability is 0%, is prometheus1004 failing too ?
[09:58:43] <godog>	 ah no nevermind, I know why
[09:58:44] <jynus>	 it is not, at least I only get errors from 3 on log
[09:59:07] <godog>	 yeah that makes sense, I'll take a look
[10:00:12] <jynus>	 maybe it is a simple thing like a race condition on puppet run and the puppet cert wasn't loaded?
[10:00:36] <jynus>	 my guess on what is different is that the other https endpoints use certgen
[10:00:54] <jynus>	 or other cert
[10:01:52] <jynus>	 let me report my findings on that ticket- this is not critical for me at the time (I am not running backups right now)
[10:02:51] <godog>	 SGTM
[10:03:20] <jynus>	 so plese don't leave everyhing for this, but this will be important later on
[10:03:49] <godog>	 jynus: FYI the error on the prometheus side is indeed with the cert
[10:03:51] <godog>	 Get "https://backup1004.eqiad.wmnet:9000/minio/v2/metrics/cluster": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0
[10:04:05] <godog>	 I'll update the task as well and then need to run to lunch
[10:04:07] <jynus>	 ah
[10:04:28] <jynus>	 I don't mind if I have to change how I do certs on client
[10:04:28] <moritzm>	 elukey: the ROCM repository key is expired, which blocks repo updates, could you figure out what the new key is and update it?
[10:05:53] <godog>	 jynus: yeah looks like SAN is a requirement now
[10:07:38] <jynus>	 cool, thanks that is super-helpful
[10:07:56] <elukey>	 moritzm: sure, do you need it now (namely blocking things) or after lunch is ok?
[10:08:09] <jynus>	 we can talk later, I don't mind adapting on my side, I just had to know the new requirements 0:-)
[10:08:30] <godog>	 jynus: sounds good, thank you! ttyl
[10:09:06] <elukey>	 should be https://repo.radeon.com/rocm/apt/debian/dists/xenial/ in theory
[10:09:07] <moritzm>	 elukey: thanks! any time today is fine
[10:09:43] <moritzm>	 this is needed for the gitlab import, but can wait a few hours for sure
[11:26:49] <jayme>	 as heads up: I'll be running docker pull tests in eqiad again. This time from 73 parallel servers max (excluding the ones causing sessionstore alerts this time ;-))
[12:47:51] <elukey>	 moritzm,jelto - updated the ROCm gpg key, and  tried "reprepro --noskipold --component thirdparty/gitlab checkupdate buster-wikimedia" on apt1001, all works afaics
[12:48:19] <elukey>	 (no errors related to ROCm etc.. displayed)
[12:48:32] <elukey>	 I hope the upgrade is unblocked, lemme know otherwise!
[13:22:05] <joe>	 jayme: heads up, I'm disabling puppet on all appservers
[13:22:11] <joe>	 applying a change to apache
[13:22:39] <jayme>	 joe: okay. I'm just running my stuff in eqiad, no puppet changes planned
[13:22:48] <joe>	 ack
[13:22:53] <joe>	 very curious about the results
[13:22:55] <joe>	 btw
[13:23:02] <joe>	 use a recent mw image if possible
[13:24:25] <jayme>	 I actually wanted to use the same as for all other tests
[14:34:40] <godog>	 jynus: re: minio certs yeah I think the simplest (maybe not easiest) is to use certs with fqdn in SAN, that's my understanding of what would make it work
[14:35:52] <godog>	 in other words have the fqdn in SAN in addition to (or replace) CN
[14:49:13] <joe>	 godog: are we running minio?
[14:49:21] <joe>	 I wasn't aware :)
[14:49:24] <Emperor>	 moritzm: I've just added my key to pwstore ; can you add me to .users please? [if you'd rather I emailed about this, happy to... ]
[14:51:28] <godog>	 joe: indeed, used for object storage backups only afaik
[14:51:47] <joe>	 oh ok :)
[14:54:33] <moritzm>	 Emperor: sure thing, I've added your key and re-encrypted the secrets, let me know if you should run into any issues
[14:55:30] <kormat>	 Emperor`: be very very glad we no longer require cross-signing of gpg keys via public key servers. that was hell.
[14:55:46] <Emperor>	 stupid power blip. 
[14:56:15] <Emperor>	 kormat: I just signed my WM ID with my Debian key :)
[15:02:50] <Emperor>	 moritzm: trying to fetch one of the trusted keys isn't working (I got yours fine): "gpg: key 37E9B5C6F5F6A067: new key but contains no user ID - skipped"
[15:03:02] <Emperor>	 when I do gpg --recv-keys B049B180212E42A7AFB4D43337E9B5C6F5F6A067
[15:04:54] <moritzm>	 these days the keys are all fetched from the keys/ directory within the repo, but most older keys are still on the SKS key servers
[15:05:15] <moritzm>	 pwstore from wmf-sre-laptop is patched to fetch them from the keys directory
[15:05:41] <moritzm>	 the 37E9B5C6F5F6A067 is the one by mutante
[15:05:48] <moritzm>	 (Daniel Zahn)
[15:09:33] <Emperor>	 moritzm: right; I did pws update-keyring (seems OK) but pws ed management (or anything else) says "Warning: gpg returned non-zero exit status 2 when decrypting rt."
[15:12:17] <Emperor>	 ...but I can decrypt secrets by hand from that repo
[15:13:33] <Emperor>	 (my pws is from wmf-sre-laptop)
[15:16:29] <moritzm>	 hmmh, not sure what that is, works for me. but if you can "gpg --decrypt rt" the addition of your key and the re-encryption worked fine
[15:16:41] <Emperor>	 yeah, that works.
[15:17:09] <moritzm>	 maybe any non-standard options in gpg.conf messing with what pwstore expects to parse or so, not sure
[15:17:20] <kormat>	 gpg, so easy to use
[15:18:46] <Emperor>	 Huh, I straced it and it worked.
[15:18:58] <vgutierrez>	 it performs under pressure
[15:19:25] <Emperor>	 ...and now it works without strace. This is going to be a gpg race, isn't it?
[15:20:12] <Emperor>	 Anyhow, now seems to be behaving itself
[15:24:24] <moritzm>	 it might be ad hoc gpg agent that gets started since gpg 2.0? if you ran gpg --decrypt manually in between, the agent will now be around. maybe try killing it and then re-run "pws ed rt" to see if you can repro?
[15:24:34] <jayme>	 joe: numbers look nice https://people.wikimedia.org/~jayme/pulltiming_dragonfly_73_nodes.html
[15:25:41] <Emperor>	 moritzm: that's it, yes - if I kill the gpg agent, I then get the failure mode again. And then if I decrypt something by hand, that restarts the agent and then pws works again
[15:27:29] <rzl>	 Emperor: I'm pleased to see the pwstore setup is as usable as ever :) welcome aboard
[15:27:50] <vgutierrez>	 proper linux as desktop experience
[15:31:05] <joe>	 jayme: most importantly, you didn't bring down production
[15:31:18] <moritzm>	 ok, that explains it. if there's an option to force the startup of the agent as part of the gpg invocation in pws, we can patch this in wme-sre-laptop (or submit to https://github.com/weaselp/pwstore/ where our patched version is based on)
[15:33:31] <Krinkle>	 marostegui: soft ping regarding https://phabricator.wikimedia.org/T263127 - looking at planning for August, do you expect this to happen in August?
[15:34:11] <marostegui>	 Krinkle: I might but only for s6
[15:37:19] <rzl>	 (oops, on rereading, that joke came off reading a lot meaner than I meant it to -- sorry about that)
[15:38:18] <joe>	 rzl: and we were here, hopeful to have finally corrupted your soul
[15:41:13] <Krinkle>	 jayme: that's a nice use of plotly! Did you add the plot invocation there yourself or was that generated with a UI of sorts? Seems like that might save me some trouble if the latter (and it's nicely standalone it seems, no network reqs)
[15:43:22] <jayme>	 Krinkle: it's just a bit of potentially horrible python scripting https://phabricator.wikimedia.org/P15954
[15:44:33] <jayme>	 joe: yeah - and that :-) registry traffic does peak around 10MB/s still while overall traffic in the P2P network peaks at ~1.5GB/s
[15:45:10] <Krinkle>	 jayme: oh, so there's a plotly python lib that generates a standalone html file for you? that's pretty cool.
[15:45:23] <Krinkle>	 I was looking around in their online UI if there was like an "Export as html file" option, which there doesnt' appear to be.
[15:45:48] <jayme>	 Krinkle: yeah. The files are pretty big, though.
[15:46:19] <Krinkle>	 the whole page is smaller than a typical favicon.ico file though
[15:46:30] <Krinkle>	 but yeah from a clean html perspective, it's large.
[15:46:57] <jayme>	 what? :D the page is 3.4M 
[15:46:57] <Krinkle>	 (which, of course, says more about favicons than plotly, but alas..)
[15:47:06] <jayme>	 I hope no favicon is that big
[15:47:19] <Krinkle>	 jayme: oh, wow, okay I misread that. Yes, that is big.
[15:47:57] <Krinkle>	 it's 0.9M on the wire. I recall that being smaller in the past..
[15:48:00] * Krinkle checks other plots
[15:48:20] <Krinkle>	 I was thinking something like 100K or so
[15:49:04] <Krinkle>	 darn, no, their "regular" distribution is that size indeed now
[15:49:06] <Krinkle>	 https://cdn.plot.ly/plotly-2.2.0.min.js
[16:08:40] <dancy>	 jayme: I'm currently having a problem with 'docker push' of a large image to the registry.  It uploads about 5GB of the 6GB, then says "Retrying in 10 seconds", then it restarts.  It has done this several times so far.
[16:08:43] <dancy>	 No error message
[16:09:37] <joe>	 not sure jayme's still around
[16:10:33] <jayme>	 actually I'm just about to leave. But feel free to write something down somwhere and I can take a look/reproduce tomorrow 
[16:11:05] <dancy>	 Will do. Thanks.
[16:11:06] <jayme>	 if that's okay...if it's urgend, I can potentially look in a couple of hours
[16:11:46] <jayme>	 o/
[16:12:47] <joe>	 dancy: can you retry now?
[16:13:18] <dancy>	 It's running now
[16:13:42] <dancy>	 and it stopped and is in the retry wait again... and now pushing again..
[16:14:35] <joe>	 I can't find a good reason in the logs of the registries
[16:15:06] <joe>	 ok found one
[16:17:07] <dancy>	 on the sending side (dockerd): msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error"
[16:17:29] <joe>	 yeah I think the problem is at the nginx layer
[16:17:46] <dancy>	 I was leaning toward blaming NGINX 
[16:18:22] <joe>	 hah
[16:18:23] <joe>	 2021/08/04 16:13:19 [crit] 24078#24078: *4325 pwrite() "/var/lib/nginx/body/0000001575" failed (28: No space left on device), client: 10.64.48.17, server: , request: "PATCH
[16:18:25] <joe>	 /v2/restricted/mediawiki-multiversion/blobs/uploads/857c8f0d-93f8-4af8-a530-147a057ae3ff?_state=2ZLT3RW5zpsfQhQJu-KVnsohlzknQowPCgsXpRMH_UR7Ik5hbWUiOiJyZXN0cmljdGVkL21lZGlhd2lraS1tdWx0aXZlcnNpb24iLCJVVUlEIjoiODU3YzhmMGQtOTNmOC00YWY4LWE1MzAtMTQ3YTA1N2FlM2ZmIiwiT2Zmc2V0IjowLCJTdGFydGVkQXQiOiIyMDIxLTA4LTA0VDE2OjA1OjAyLjY0ODY0MDM4MVoifQ%3D%3D HTTP/1.1", host: "docker-registry.discovery.wmnet"
[16:19:06] <dancy>	 that'll do it
[16:20:36] <joe>	 try now :)
[16:22:29] <dancy>	 restarting...
[16:22:34] <joe>	 we had just 5 G of space left on disk
[16:23:30] <joe>	 because we started caching blobs in nginx (only binary, sha256-based urls, so it's unique urls and we don't have to worry about evictions, btw)
[16:24:17] <joe>	 so I just went and took the Attila approach: find . -type f -delete
[16:24:30] <dancy>	 Seek and destroy
[16:25:58] <joe>	 I'm going to take a break before the two hours of meeting in which I'm presenting stuff 😱
[16:26:21] <dancy>	 ok!
[16:31:48] <dancy>	 failed again. :-(
[16:35:01] <joe>	 dancy: sigh, I was mistaken. /var/lib/nginx is a separate directory
[16:35:53] <joe>	 err partition
[16:35:56] <joe>	 and it's 1 GB
[16:36:03] <joe>	 so I guess one layer got over that size
[16:36:06] <kormat>	 👀
[16:36:17] <joe>	 yeah no idea why it was done this way
[16:37:03] <kormat>	 joe: knowing ~partman~, i'd be willing to bet this could be a recipe not doing what someone expected
[16:37:25] <joe>	 kormat: possibly; for now I'll just unmount it
[16:37:33] <joe>	 dancy: gimme 5 minutes more
[16:37:55] <dancy>	 thx joe.  I'm in a meeting now and the retries are happening automatically so I'm not stuck at the moment.
[16:39:21] <kormat>	 joe: /var/lib/nginx is _tmpfs_
[16:39:34] <dancy>	 oh man.
[16:40:02] <joe>	 kormat: yeah, in the meantime I've noticed
[16:40:10] <joe>	 that was likely done for performance
[16:40:24] <joe>	 so I am thinking of enlarging it a bit for now
[16:43:53] <joe>	 ok bumped to 2 G now
[16:44:29] <kormat>	 inb4 dancy's 6G image has 1 layer which is 2.0001 GB
[16:46:18] <joe>	 kormat: no the layer adapts to the size of buffers in nginx
[16:46:33] <kormat>	 :D
[16:53:40] <dancy>	 success!
[16:55:56] <joe>	 (scare quotes implied)
[16:56:14] <joe>	 dancy: I'm relieved, it just took me 3 attempts to get it right
[17:39:07] <jynus>	 one last question, joe, sorry, is there any delay between current mw deploy and image rebuild? I guess that will take time?
[17:40:05] <joe>	 jynus: look at sal, the delay between the scap deploy and the deploy to k8s by "deploy-mwdebug"
[17:40:18] <jynus>	 thanks, joe 
[17:40:48] <joe>	 that's better than the theoretical answer
[17:40:58] <joe>	 sometimes it will happen before scap even
[17:41:13] <jynus>	 yeah, I was to ask you to just point me to code, but that is better :-D
[17:41:19] <jynus>	 *going
[17:46:24] <jynus>	 15 minutes or so, less than a full scap, but I guess it is unfair for scap in terms of # of nodes
[17:46:51] <jynus>	 surprisingly fast IMHO for such complexity still
[17:46:58] <jynus>	 good work
[17:51:02] <joe>	 it should take less, actually, it will improve over time
[17:51:46] <joe>	 today we hit an issue with the docker registry, so that could be a factor
[18:10:26] <dancy>	 The incremental image build process is live now so commits to operations/mediawiki-config and mediawiki/core (and submodules) should result in a new image within 2 minutes (For small changes,.. things that don't require l10n rebuild)
[18:44:08] <joe>	 dancy: so I think we'll start deploying to k8s before scap happens :P
[18:45:50] <majavah>	 dancy: is there any overview of the process for us that can't join wmf-only meetings?
[18:46:16] <dancy>	 joe: Can we share the slide deck ?
[18:46:41] <joe>	 dancy: sure, once we cleaned out all the compromising comments :D
[18:46:47] <dancy>	 haha
[18:46:48] <dancy>	 ok
[18:46:56] <joe>	 we just can't put it on commons sadly
[18:47:01] <joe>	 because I used non-free images
[18:47:24] <majavah>	 :/
[18:49:16] <joe>	 majavah: https://people.wikimedia.org/~oblivian/An%20introduction%20to%20mw%20on%20kubernetes.pdf
[18:49:44] <joe>	 majavah: time constraints made me search the first cake slice image I could find
[18:57:02] <bd808>	 joe: wikitech doesn't have an fair-use violation police. feel free to upload there if you would like
[18:59:20] <joe>	 maybe tomorrow, now it's dinner time, I've had enough work for today