[05:29:09] I am going to disable writes on es4 [05:29:11] To failover its master [07:02:09] all done [07:02:14] and writes are back on es4 [07:02:17] I am going to do the same for es4 [07:02:19] es5 [08:04:18] es5 fully done [09:24:12] x1 backups went -6.6 % , I believe this is wat Amir warned us about [09:28:24] yeah most likely [09:28:43] I am going to push this https://gerrit.wikimedia.org/r/c/operations/puppet/+/922476 which should be a noop, and then fix gtid_domain_id on db_inventory which should be pretty straightforward [09:47:40] heh I broke replication on db_inventory [09:47:46] messed up a transaction [09:47:50] it is so easy to break things [09:47:53] I will get it fixed [10:10:05] better here than in a different section! [10:13:35] the sequential, one thread write speed of our backup hosts seems to be ~1.1GiB/s (which confirms the 10 Gbit/s network is worth investing, even for writes) [10:15:07] oh, wait, I haven't yet saturated the file cache [10:15:11] let's wait until that happens [10:19:49] yeah, exactly that's why I am going with sections where I can catch errors right away :) [10:20:31] it is around the same, as the graph showed real io activity, not the os write speed, including cache [10:21:51] bets on what will happen if I start writing on a second thread ? [10:23:07] my guess is a 50 or 25% increase in io activity [10:33:28] I was more or less right, a second sequential thread increased write throughput by a 50%-60% [10:38:42] I think I am going to have to reclone codfw db_inventory host [10:38:54] Fixing the data is insanely crazy with all the writes orchestrator does [10:41:01] Going to stop orchestrator to copy its database (38M) [10:45:38] 👌 [10:46:07] let me know if you want to recover from backups, the logic backup is from tonight [10:46:31] and we can test that the script works [10:48:27] oh sorry [10:48:29] I finished already :( [10:48:37] nah, that's a good thing [10:48:48] as it was just 38M i thought it was be a matter of like 3 minutes to do the whole thing [10:49:03] 👍 [10:49:20] I think I need some gtid rest [10:49:28] It is so intense to work with those numbers and positions [11:28:20] In case this finishes when I am not around, ignore any issue arising from full disk at backup2010 today or tomorrow: https://grafana.wikimedia.org/goto/KIAiqVQVz?orgId=1 [11:52:19] Emperor: there's https://gerrit.wikimedia.org/r/c/operations/alerts/+/812883 for your attention if you wouldn't mind taking a look, it'll replace modules/profile/manifests/swift/alerts.pp and I'll merge/deploy if it looks good, I'm also looking at it in the context of T288196 [11:52:21] T288196: Retire Prometheus 'global' instance - https://phabricator.wikimedia.org/T288196 [11:59:06] 👀 [12:05:29] cheers [18:05:52] urandom ottomata how do I got about listing the content of a swift (thanos) bucket? I wanted to inspect flink checkpoints. I followed the doc in https://wikitech.wikimedia.org/wiki/Swift/How_To#List_containers_and_contents (set up ST_AUTH & c) [18:06:13] swift list failed with [18:06:14] Auth GET failed: https://thanos-swift.discovery.wmnet/auth/v1.0 401 Unauthorized [first 60 chars of response] b'

Unauthorized

This server could not verify t' [18:06:14] Failed Transaction ID: txff83cbba269341619baca-00646cff87 [18:17:38] gmodena: you exported `ST_AUTH`, `ST_USER`, and `ST_KEY`? [18:18:24] gmodena: btw, beware that doing so will expose that in your bash history if you do it interactively [18:19:06] urandom I exported them from bashrc [18:20:02] gmodena: and you can echo them? (sorry, have to ask :)) [18:20:42] (and I appreciated that storing secrets in there is not ideal either, I remove as soon as I finish checking. Promise) [18:20:57] urandom asking is good :). I can echo them (I did not try ST_KEY) [18:21:10] what are AUTH & USER? [18:21:26] ST_USER=mediawiki-event-enrichment:mediawiki-event-enrichment [18:21:35] ST_AUTH=https://thanos-swift.discovery.wmnet/auth/v1.0 [18:21:51] that should be: mw-event-enrichment:prod (USER) [18:22:08] ah! [18:23:02] urandom that was it. Many thanks! [18:23:10] no worries :) [19:33:10] urandom: o/ just double checking [19:33:10] https://phabricator.wikimedia.org/T330693#8841909 [19:33:24] we should be using "mediawiki-event-enrichment", or "mw-event-enrichment:prod" ? [19:45:07] ottomata: the account name is mediawiki-event-enrichment, the username (there can be an arbitrary number of users for an account) is mw-event-enrichment:prod [19:46:38] okay, and we should use the user name to authenticate? (as indicated by your convo ^) [19:46:38] ? [19:46:47] ya [19:46:56] okay thanks [19:47:30] I didn't mention the username in the ticket (mostly because I didn't mention the key, for obvious reasons) [19:48:54] oh that's in private too right okay, we should probably set that value from private then [19:49:06] right [19:50:33] I should have been clearer about the account/user nomenclature, this isn't the first time it's created confusion [19:55:21] urandom: i dont' actually see the user names in private either. i just see the entry in profile::thanos::swift::accounts_keys which has a key of mw_event_enrichment [19:55:22] ? [19:57:30] oh yeah, you're right. That key maps to the user definition in hieradata/common/profile/thanos/swift.yaml [19:59:08] * urandom sighs [19:59:39] ok, sorry about that...that makes the username pretty obscure [19:59:50] noworries [19:59:51] hm [19:59:55] that file says [19:59:59] account_name: 'AUTH_mw-event-enrichment' [20:00:07] yeah, I just noticed that too [20:01:53] I think that was an 11th hour change, out of concern for the potential length of those names. Shortening 'mediawiki' to 'mw' as a guard against future length issues [20:02:05] we had to do the same for k8s namespace names [20:02:13] 11th hour as well :) [20:03:49] I'm actually happy to know that I didn't in fact use the full length version for the account, and a short form for the username, as that phab comment would have suggested [20:05:17] same in hindsight that seems better