[06:45:39] <marostegui>	 I am going to switch es5 codfw master
[06:53:35] <marostegui>	 jynus: let me know if/when I can reboot es2024
[08:56:17] <Amir1>	 so switch, so much stuff
[08:56:37] <marostegui>	 Yeah, I have had to switch a few parsercache hosts today too XD
[09:23:03] <jynus>	 marostegui: backups start tonight at 0 UTC, so anything can be done until then
[09:23:37] <marostegui>	 cool, doing it now then
[09:36:41] <Amir1>	 marostegui: to avoid stepping on your toes. For today I have only maint on s7
[09:36:51] <marostegui>	 Amir1: go for it
[09:37:13] <Amir1>	 awesome
[09:37:22] <marostegui>	 Amir1: I am with s4 and s6 today only
[09:37:34] <Amir1>	 noted
[09:38:11] <Amir1>	 I might need to run flaggedrevs clean up on masters today but it shouldn't affect anything
[09:38:49] <marostegui>	 ok!
[09:39:32] <jinxer-wm>	 (MysqlReplicationLag) firing: MySQL instance es2024:9104 has too large replication lag (14m 38s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=es2024&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[09:39:46] <marostegui>	 ^ me
[09:44:32] <jinxer-wm>	 (MysqlReplicationLag) resolved: MySQL instance es2024:9104 has too large replication lag (15m 27s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=es2024&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag
[10:35:44] <jynus>	 Amir1: when merging 817181 should I merge all grant change at once, one at a time, any preference?
[10:35:48] <jynus>	 *changes
[10:36:02] <Amir1>	 no preferences, what you like
[10:40:58] <jynus>	 ok, then merging all grant changes at once and monitoring should be done in case something breaks because of it
[10:53:46] <jynus>	 I will also check if db2093 breaks replication
[11:01:11] <jynus>	 db2093 broke a few times, so I created the missing empty user and restarted replication
[11:01:26] <jynus>	 *db2093's replication
[11:02:28] <jynus>	 now comparing it to the file to detect more drifts
[11:20:39] <marostegui>	 Amir1: For any particular reason db1132 (10.6) wasn't rebooted by your script? Is it cause it excludes 10.6 hosts for now?
[11:22:27] <Amir1>	 marostegui: nah, I think it just failed to depool it, it can happen, just less frequency (specially until we fix all maint scripts)
[11:22:45] <jynus>	 I have one question about grants differing between documented and live
[11:23:06] <Amir1>	 marostegui: if it's too many of them, let me know so I take a look
[11:25:06] <jynus>	 should ulsfo, eqsin and esams prometheus hosts have access to eqiad and codfw dbs for metrics scrapping? CC godog
[11:26:15] <Amir1>	 I'd say no. they shouldn't
[11:26:29] <marostegui>	 Amir1: sure, no problem
[11:26:33] <jynus>	 yeah, I agree
[11:26:51] <godog>	 jynus: no I don't think so either
[11:27:16] <jynus>	 actually, sorry, I missnamed- it is not for metrics scrapping
[11:27:50] <jynus>	 that is handled by the firewall and generated on localhost
[11:28:28] <jynus>	 this is read access to the zarcillo db for generating scraping files
[11:28:33] <jynus>	 *config files
[11:28:41] <jynus>	 but I think probably the answer is no too
[11:28:46] <godog>	 I'm guessing for generating the list of dbs to get metrics from, if we don't to that already we should then limit the script that does that to eqiad and codfw
[11:29:19] <jynus>	 I am not sure if it is limited or it just returns 0 hosts
[11:29:22] <jynus>	 I will have to check
[11:29:22] <godog>	 (I'm checking)
[11:30:02] <godog>	 yeah I think we're fine to just remove access, we are not running the script in pops anyways
[11:30:07] <jynus>	 nice
[11:30:11] <jynus>	 thank you, godog!
[11:30:29] <godog>	 sure np, was easy enough
[11:30:39] <jynus>	 so I will remove it, or comment it for "when we have dbs on pops" (which may be never)
[11:30:58] <godog>	 *nod* just yanking it is fine
[11:31:28] <jynus>	 godog: one last thing, let me make sure the script is running well on prometheus hosts with the grant clean up
[11:33:13] <godog>	 SGTM jynus 
[11:34:31] <jynus>	 no errors on prometheus[12]00[56], thank you godog, that's all!
[11:35:24] <jynus>	 I will update grants/db_inventory.sql.erb to remove the non-main edges references
[11:35:33] <jynus>	 (this was the state live already)
[11:37:05] <Emperor>	 Ugh, I found another failure mode of pristine-tar that the current test suite missed (good thing I tried some more tests); some versions of it _incorrectly_ unquote paths before storing them in the delta's manifest and then _incorrectly_ unquote those paths again when producing a tarball (which the delta is produced against) and make extra empty dirs in said tarball based on that double-incorrect-unquoted path that it couldn't find 
[11:37:05] <Emperor>	 the source directory
[11:44:17] <jynus>	 will merge this https://gerrit.wikimedia.org/r/c/operations/puppet/+/853950 as it is an easier change (noop for live production)
[11:45:34] <jynus>	 Emperor: if you need examples of weird file names on swift, let me know, I know a few 0:-D
[15:37:06] <jynus>	 Amir1: what I didn't fully understood is if to fix ['db2217'] ['db1150', 'db2117'] they are have to be added to hieradata or removed from zarcillo?
[15:39:56] <marostegui>	 I will fix it, no worries 
[15:40:48] <jynus>	 ok, I thought as they were some backup sources
[15:40:54] <jynus>	 you were asking for help
[15:41:14] <jynus>	 or at least one was
[16:02:47] <marostegui>	 one was?
[16:02:57] <marostegui>	 not sure I'm following 
[16:06:28] <jynus>	 db1150 I think it is a backup source
[16:06:46] <jynus>	 so I was trying to understand what was the request to do it
[16:09:07] <Amir1>	 jynus: db1150 is a backup source
[16:09:09] <Amir1>	 but
[16:09:12] <Amir1>	 ladsgroup@cumin1001:~$ sudo db-mysql db1115 -e "use zarcillo; select * from servers where hostname = 'db1150';"
[16:09:12] <Amir1>	 ladsgroup@cumin1001:~$ 
[16:10:02] <jynus>	 ok, that helps
[16:12:04] <jynus>	 the server was deleted from zarcillo, as I have it on an older copy. I will readd it
[16:15:44] <jynus>	 so I added it now
[16:16:12] <jynus>	 I have pending addition foreign keys there that could help minimize some errors
[17:55:07] <jynus>	 Emperor: to be clear- I would like to see what you commented on ongoing research at T322424, your comments during the meeting were very useful and hopefully it can be written there too
[17:55:07] <stashbot>	 T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424
[17:57:54] <Emperor>	 Hm, I think that ticket has the headline items at least
[17:58:59] <Emperor>	 (it might be useful to have a look through the code to see if it matches our working theory, but I'm not sure when/if I'm likely to have time/brain to do so)
[18:02:42] <Emperor>	 (or did you mean you wanted the other actionables from the incident review added to that ticket?) [in any case, I need to go and think about something more cheerful than the state of Swift for a bit now]
[18:03:40] <jynus>	 no, I mean you explained in detail what happened during the meeting. for now, just having that explanation on the header or on a comment would be useful
[18:08:42] <jynus>	 either there or at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-04_Swift_issues and linked in the ticket
[18:17:33] <jynus>	 there is probably some good middle term between the current incident status and, eg. https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_x2_databases_replication_breakage 0:-)