[06:45:39] I am going to switch es5 codfw master [06:53:35] jynus: let me know if/when I can reboot es2024 [08:56:17] so switch, so much stuff [08:56:37] Yeah, I have had to switch a few parsercache hosts today too XD [09:23:03] marostegui: backups start tonight at 0 UTC, so anything can be done until then [09:23:37] cool, doing it now then [09:36:41] marostegui: to avoid stepping on your toes. For today I have only maint on s7 [09:36:51] Amir1: go for it [09:37:13] awesome [09:37:22] Amir1: I am with s4 and s6 today only [09:37:34] noted [09:38:11] I might need to run flaggedrevs clean up on masters today but it shouldn't affect anything [09:38:49] ok! [09:39:32] (MysqlReplicationLag) firing: MySQL instance es2024:9104 has too large replication lag (14m 38s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=es2024&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [09:39:46] ^ me [09:44:32] (MysqlReplicationLag) resolved: MySQL instance es2024:9104 has too large replication lag (15m 27s) - https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting#Depooling_a_replica - https://grafana.wikimedia.org/d/000000273/mysql?orgId=1&refresh=1m&var-job=All&var-server=es2024&var-port=9104 - https://alerts.wikimedia.org/?q=alertname%3DMysqlReplicationLag [10:35:44] Amir1: when merging 817181 should I merge all grant change at once, one at a time, any preference? [10:35:48] *changes [10:36:02] no preferences, what you like [10:40:58] ok, then merging all grant changes at once and monitoring should be done in case something breaks because of it [10:53:46] I will also check if db2093 breaks replication [11:01:11] db2093 broke a few times, so I created the missing empty user and restarted replication [11:01:26] *db2093's replication [11:02:28] now comparing it to the file to detect more drifts [11:20:39] Amir1: For any particular reason db1132 (10.6) wasn't rebooted by your script? Is it cause it excludes 10.6 hosts for now? [11:22:27] marostegui: nah, I think it just failed to depool it, it can happen, just less frequency (specially until we fix all maint scripts) [11:22:45] I have one question about grants differing between documented and live [11:23:06] marostegui: if it's too many of them, let me know so I take a look [11:25:06] should ulsfo, eqsin and esams prometheus hosts have access to eqiad and codfw dbs for metrics scrapping? CC godog [11:26:15] I'd say no. they shouldn't [11:26:29] Amir1: sure, no problem [11:26:33] yeah, I agree [11:26:51] jynus: no I don't think so either [11:27:16] actually, sorry, I missnamed- it is not for metrics scrapping [11:27:50] that is handled by the firewall and generated on localhost [11:28:28] this is read access to the zarcillo db for generating scraping files [11:28:33] *config files [11:28:41] but I think probably the answer is no too [11:28:46] I'm guessing for generating the list of dbs to get metrics from, if we don't to that already we should then limit the script that does that to eqiad and codfw [11:29:19] I am not sure if it is limited or it just returns 0 hosts [11:29:22] I will have to check [11:29:22] (I'm checking) [11:30:02] yeah I think we're fine to just remove access, we are not running the script in pops anyways [11:30:07] nice [11:30:11] thank you, godog! [11:30:29] sure np, was easy enough [11:30:39] so I will remove it, or comment it for "when we have dbs on pops" (which may be never) [11:30:58] *nod* just yanking it is fine [11:31:28] godog: one last thing, let me make sure the script is running well on prometheus hosts with the grant clean up [11:33:13] SGTM jynus [11:34:31] no errors on prometheus[12]00[56], thank you godog, that's all! [11:35:24] I will update grants/db_inventory.sql.erb to remove the non-main edges references [11:35:33] (this was the state live already) [11:37:05] Ugh, I found another failure mode of pristine-tar that the current test suite missed (good thing I tried some more tests); some versions of it _incorrectly_ unquote paths before storing them in the delta's manifest and then _incorrectly_ unquote those paths again when producing a tarball (which the delta is produced against) and make extra empty dirs in said tarball based on that double-incorrect-unquoted path that it couldn't find [11:37:05] the source directory [11:44:17] will merge this https://gerrit.wikimedia.org/r/c/operations/puppet/+/853950 as it is an easier change (noop for live production) [11:45:34] Emperor: if you need examples of weird file names on swift, let me know, I know a few 0:-D [15:37:06] Amir1: what I didn't fully understood is if to fix ['db2217'] ['db1150', 'db2117'] they are have to be added to hieradata or removed from zarcillo? [15:39:56] I will fix it, no worries [15:40:48] ok, I thought as they were some backup sources [15:40:54] you were asking for help [15:41:14] or at least one was [16:02:47] one was? [16:02:57] not sure I'm following [16:06:28] db1150 I think it is a backup source [16:06:46] so I was trying to understand what was the request to do it [16:09:07] jynus: db1150 is a backup source [16:09:09] but [16:09:12] ladsgroup@cumin1001:~$ sudo db-mysql db1115 -e "use zarcillo; select * from servers where hostname = 'db1150';" [16:09:12] ladsgroup@cumin1001:~$ [16:10:02] ok, that helps [16:12:04] the server was deleted from zarcillo, as I have it on an older copy. I will readd it [16:15:44] so I added it now [16:16:12] I have pending addition foreign keys there that could help minimize some errors [17:55:07] Emperor: to be clear- I would like to see what you commented on ongoing research at T322424, your comments during the meeting were very useful and hopefully it can be written there too [17:55:07] T322424: Commons/multimedia errors, caused by repeated swift (cascading?) failures, late 2022 - https://phabricator.wikimedia.org/T322424 [17:57:54] Hm, I think that ticket has the headline items at least [17:58:59] (it might be useful to have a look through the code to see if it matches our working theory, but I'm not sure when/if I'm likely to have time/brain to do so) [18:02:42] (or did you mean you wanted the other actionables from the incident review added to that ticket?) [in any case, I need to go and think about something more cheerful than the state of Swift for a bit now] [18:03:40] no, I mean you explained in detail what happened during the meeting. for now, just having that explanation on the header or on a comment would be useful [18:08:42] either there or at https://wikitech.wikimedia.org/wiki/Incidents/2022-11-04_Swift_issues and linked in the ticket [18:17:33] there is probably some good middle term between the current incident status and, eg. https://wikitech.wikimedia.org/wiki/Incidents/2022-08-16_x2_databases_replication_breakage 0:-)