[09:58:25] Emperor: o/ [09:59:36] I am debugging some issues between the docker registry and swift, all the context is in https://phabricator.wikimedia.org/T390251 but TL;DR is that sometimes the registry seems to return a binary blob (relatedo a layer) that is not expected, causing dockerd to fail while doing the sha256 checks [09:59:52] it autoresolves, and it seems happening at random times [10:00:09] usually it happens after a certain mw image is pushed (as part of a deployment) [10:00:24] basically, push-then-pull doesn't work immediately [10:00:57] if I want to review swift logs, where should I start in your opinion? I am checking the frontend proxies but probably it is not enough [10:01:05] (no rush, when you have a moment [10:07:19] elukey: is this thanos-swift or ms-swift? [10:09:36] (if ms-swift it's worth checking if the uploaded content in both clusters is identical) [10:09:52] Emperor: ms-swift (I assume commons/mediawiki right?) [10:10:06] we push only to codfw, the registry in eqiad is depooled [10:10:07] yes, ms-swift is what backs commons [10:10:33] ah, right OK. [10:12:11] so the frontends log in /var/log/swift/proxy-access.log (and sometimes server.log especially if error); if you find the log(s) you're interested in, they will have transaction IDs which can be useful as they will appear in e.g. backend logs too [10:13:06] okok so the frontends are the right place to check [10:13:07] With container and object name it's possible to look up in the rings where the underlying object is stored, so one could then check that the 3 replicas are identical. [10:13:19] I am basically doing journalctl -u swift-proxy.service [10:13:41] ahhh interesting [10:13:41] But the underlying replication is async; but I would expect the primary copy to be the one that serves subsequent requests so that shouldn't matter. [10:14:22] When I'm investigating upload tickets, I usually end up with some horror like [10:14:24] cumin -x --force --no-progress --no-color -o txt O:swift::proxy "zgrep -F 'wikipedia-commons-local-public.8e/8/8e/Falstaff-Szene_A1885.jpg' /var/log/swift/proxy-access.log.2.gz" >~/junk/T389539_second [10:14:56] thanks :D [10:15:09] beware that object names are double-url-encoded in the proxy log [10:15:39] so if you have interesting characters in your object name then something like [10:15:44] python3 -c "import urllib.parse ; print(urllib.parse.quote(urllib.parse.quote('Falstaff-Szene_A1885.jpg')))" [10:15:46] IYF [10:17:35] (do let me know if you need any more cursed knowledge, err, I mean, swift debugging tips) [10:19:05] I will :D [10:27:53] I found a lot of interesting things, like HTTP 499 (IIUC clients giving up before swift finishes returning data) - is there somewhere the access log pattern to identify numbers/values? [10:32:07] yes, give me a tick [10:33:15] even this afternoon [10:33:16] elukey: https://docs.openstack.org/swift/queens/logs.html has the log format, but note that you need to add 5 to the field number e.g. cut -d '' -f 6 for the first field [10:34:08] because our loglines start with Mon Day Time hostname proxy-server: [10:34:24] right right perfect [10:34:28] thanks again! [10:34:46] NP :) [12:13:42] 4 billion rows read per second in s3 this morning, very normal behavior https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=core&var-shard=s3&var-role=All&from=now-12h&to=now&viewPanel=8 [12:16:20] writes were elevated too. Which means I need to go dumpster diving of the binlogs https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=All&var-shard=s3&var-role=All&from=now-12h&to=now&viewPanel=7 [12:21:40] urandom: FYI restbase2025 is alerting about disk space [12:23:02] Emperor: ok; thanks [14:43:37] hey, i'm seeing read-only errors on wmcs things on m5 and the timing of https://phabricator.wikimedia.org/T391237 seems very suspicious [14:43:52] it seems like the proxy (dbproxy1029) still sees db1228 as down? [14:45:16] marostegui: i think the haproxy reload step from https://wikitech.wikimedia.org/wiki/HAProxy#Failover was missed? [15:21:25] taavi: correct, doing it now [15:21:48] Thanks for the heads up [15:23:16] taavi: done [15:23:23] We don't have many master failures, so it is easy to forget :( [18:09:17] do you need me to start causing some more? [18:28:01] xD [19:38:52] taavi: we are hiring!