[09:58:25] <elukey>	 Emperor: o/
[09:59:36] <elukey>	 I am debugging some issues between the docker registry and swift, all the context is in https://phabricator.wikimedia.org/T390251 but TL;DR is that sometimes the registry seems to return a binary blob (relatedo a layer) that is not expected, causing dockerd to fail while doing the sha256 checks
[09:59:52] <elukey>	 it autoresolves, and it seems happening at random times
[10:00:09] <elukey>	 usually it happens after a certain mw image is pushed (as part of a deployment)
[10:00:24] <elukey>	 basically, push-then-pull doesn't work immediately
[10:00:57] <elukey>	 if I want to review swift logs, where should I start in your opinion? I am checking the frontend proxies but probably it is not enough
[10:01:05] <elukey>	 (no rush, when you have a moment
[10:07:19] <Emperor>	 elukey: is this thanos-swift or ms-swift?
[10:09:36] <Emperor>	 (if ms-swift it's worth checking if the uploaded content in both clusters is identical)
[10:09:52] <elukey>	 Emperor: ms-swift (I assume commons/mediawiki right?)
[10:10:06] <elukey>	 we push only to codfw, the registry in eqiad is depooled
[10:10:07] <Emperor>	 yes, ms-swift is what backs commons
[10:10:33] <Emperor>	 ah, right OK. 
[10:12:11] <Emperor>	 so the frontends log in /var/log/swift/proxy-access.log (and sometimes server.log especially if error); if you find the log(s) you're interested in, they will have transaction IDs which can be useful as they will appear in e.g. backend logs too
[10:13:06] <elukey>	 okok so the frontends are the right place to check
[10:13:07] <Emperor>	 With container and object name it's possible to look up in the rings where the underlying object is stored, so one could then check that the 3 replicas are identical.
[10:13:19] <elukey>	 I am basically doing journalctl -u swift-proxy.service
[10:13:41] <elukey>	 ahhh interesting
[10:13:41] <Emperor>	 But the underlying replication is async; but I would expect the primary copy to be the one that serves subsequent requests so that shouldn't matter.
[10:14:22] <Emperor>	 When I'm investigating upload tickets, I usually end up with some horror like 
[10:14:24] <Emperor>	 cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F 'wikipedia-commons-local-public.8e/8/8e/Falstaff-Szene_A1885.jpg' /var/log/swift/proxy-access.log.2.gz" >~/junk/T389539_second
[10:14:56] <elukey>	 thanks :D
[10:15:09] <Emperor>	 beware that object names are double-url-encoded in the proxy log
[10:15:39] <Emperor>	 so if you have interesting characters in your object name then something like
[10:15:44] <Emperor>	 python3 -c "import urllib.parse ; print(urllib.parse.quote(urllib.parse.quote('Falstaff-Szene_A1885.jpg')))"
[10:15:46] <Emperor>	 IYF
[10:17:35] <Emperor>	 (do let me know if you need any more cursed knowledge, err, I mean, swift debugging tips)
[10:19:05] <elukey>	 I will :D
[10:27:53] <elukey>	 I found a lot of interesting things, like HTTP 499 (IIUC clients giving up before swift finishes returning data) - is there somewhere the access log pattern to identify numbers/values?
[10:32:07] <Emperor>	 yes, give me a tick
[10:33:15] <elukey>	 even this afternoon
[10:33:16] <Emperor>	 elukey: https://docs.openstack.org/swift/queens/logs.html has the log format, but note that you need to add 5 to the field number e.g. cut -d '' -f 6 for the first field
[10:34:08] <Emperor>	 because our loglines start with Mon Day Time hostname proxy-server:
[10:34:24] <elukey>	 right right perfect
[10:34:28] <elukey>	 thanks again!
[10:34:46] <Emperor>	 NP :)
[12:13:42] <Amir1>	 4 billion rows read per second in s3 this morning, very normal behavior https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s3&var-role=All&from=now-12h&to=now&viewPanel=8
[12:16:20] <Amir1>	 writes were elevated too. Which means I need to go dumpster diving of the binlogs https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=All&var-shard=s3&var-role=All&from=now-12h&to=now&viewPanel=7
[12:21:40] <Emperor>	 urandom: FYI restbase2025 is alerting about disk space
[12:23:02] <urandom>	 Emperor: ok; thanks 
[14:43:37] <taavi>	 hey, i'm seeing read-only errors on wmcs things on m5 and the timing of https://phabricator.wikimedia.org/T391237 seems very suspicious
[14:43:52] <taavi>	 it seems like the proxy (dbproxy1029) still sees db1228 as down?
[14:45:16] <taavi>	 marostegui: i think the haproxy reload step from https://wikitech.wikimedia.org/wiki/HAProxy#Failover was missed?
[15:21:25] <marostegui>	 taavi: correct, doing it now 
[15:21:48] <marostegui>	 Thanks for the heads up
[15:23:16] <marostegui>	 taavi: done
[15:23:23] <marostegui>	 We don't have many master failures, so it is easy to forget :(
[18:09:17] <taavi>	 do you need me to start causing some more?
[18:28:01] <Amir1>	 xD
[19:38:52] <marostegui>	 taavi: we are hiring!