[10:53:44] inflatador: ryankemper: FYI i reimage cloudelastic1007 to puppet7 wihout issue. if you see the same issue again gcan you please leave the box in that state so i can investigate [11:24:56] lunch [13:35:00] ebernhardson I thought you might like to know the blue/green deployment upgrade from Elasticsearch 6.5 to 6.8 appears to have gone flawlessly. I did reduce the Nginx virtual server's DNS refresh time from 60 to 20 seconds for the appropriate portion of the process. Nice to have a real production experience with this now. Thanks for the advice the [13:35:01] other day. [14:20:10] dcausse: ebernhardson: FYI: I created the bug report for the undelete behaviour, https://phabricator.wikimedia.org/T351411 after looking into the code, it seems more like a core wiki problem, not so much anything DPE is responsible for [14:21:43] thanks [14:44:06] low priority CR for wikidata ldf monitoring if anyone has a chance to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/974281/ [15:06:16] o/ [15:06:18] looking [15:15:10] thanks jbond , reimaging the rest of the new hosts now (cloudelastic1008-10) [15:22:14] FWiW I think we stumbled on why relforge wasn't sending stuff to logstash...reverted changes we made from https://phabricator.wikimedia.org/T324335 and looks like the messages are back [15:29:09] inflatador: np [15:48:25] errand, back later tonight [16:01:39] \o [16:03:20] pfischer: i was looking into it yesterday, but couldn't decide if the mw behaviour is wrong or not. Critically it seems like the undelete *does* work, and the revision ends up back in the revision history of the new page [16:03:49] but it's odd because that's not typically how revisions are deleted (instead they have special protection flags). [16:07:34] hi, I have send a patch to update maven-javadoc-plugin from 3.2.0 to 3.3.0 which let it discovers the `javadoc` command. It got moved starting with Java 9+ https://gerrit.wikimedia.org/r/c/wikimedia/discovery/discovery-parent-pom/+/975003 [16:07:50] and I have no idea about the policy to update the dependencies nor what I should test :) [16:09:20] hashar: makes two of us :) [16:11:35] my guess is downstream projects bump the parent pom and catch up with whatever breakage might happen [16:11:44] (or fix in this case *grin*) [16:12:02] the pom dependency works by version, so they will only break once they update the dependency afaik. should be ok [16:22:14] pfischer: i asked timo to comment on the ticket about what the expected behaviour is. Apparently this sequence of events is basically the manual version of Special:MergeHistory that merges two pages together [16:36:51] There is some (unrelated) concern about the rev_id alignment, essentially that fetching the content for a rev_id older than recent is just adding extra work, and that the older revision wont be in parser cache [16:41:44] tchin made this awesome alerts dashboard! https://dpe-alerts-dashboard.toolforge.org/ [16:42:27] pfischer: related convo (if you don't have a mediawiki-core scrollback): https://wm-bot.wmcloud.org/browser/index.php?start=11%2F16%2F2023&end=11%2F16%2F2023&display=%23mediawiki-core [16:42:43] inflatador: that is pretty cool [16:55:37] https://gitlab.wikimedia.org/tchin/dpe-alerts-web-scraper and https://gitlab.wikimedia.org/tchin/dpe-alerts-dashboard are the repos [16:57:08] anyone know if there's a search interface for the public irc logs? i keep meaning to ask. <- there, i did it [16:57:35] ebernhardson: thanks, that's helpful. Shall we observe how often that happens after all before implementing steps to handle this? [16:58:18] dr0ptp4kt: hmm, i know at one point they were injected into an elasticsearch instance in cloud, maybe i can find it [16:58:32] dr0ptp4kt there is, let me see [16:59:05] ebernhardson inflatador that'd be spiffy. ebernhardson got a good time, in pacific time, to go through https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/529 ? if i'm not mistaken, https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/975022 may not need to be deployed in order to do that, but i didn't track down all of the tentacles. [16:59:08] pfischer: i suspect we can mostly ignore them. It looks like if a page is restored to a page_id other than what it was originally that means a merge has happened, and in a merge the old revision isn't the latest so we have no need to update [16:59:51] maybe with some code that intentionally ignores them instead of sending to fetch failure, not sure yet. For now we can certainly send to fetch failure and monitor [16:59:59] dr0ptp4kt https://wm-bot.wmflabs.org/libera_logs/%23wikimedia-search/ [17:09:47] thx inflatador. if we don't have the stuff ingested into an es instance, i'm thinking maybe i could use a utility to crawl the subdirs and then just grep (although maybe would be a good project for me to stand up es and ingest the things and have a simple ui!). i guess maybe the irc clients people use have nice search built in? https://www.irccloud.com/faq says some day irccloud will have it, but i think it's said that for a while [17:09:58] oops, forgot breakfast. must eat [17:10:41] my irc client writes to disk, and i use grep :P Perhaps surprisingly i have over a GB of plain text chat logs [17:10:52] * ebernhardson didn't expect them to be quite so volumous [17:14:56] dr0ptp4kt if you wanna work together on that as 10% time type project, I'd be game [17:16:31] jbond looks like the key to reimage is to use the '--new' flag, that lets you choose puppet version. Without it, my reimages failed [17:17:16] inflatador: failed how? also is it a new host? [17:18:09] jbond the cookbook fails and the host is reachable via console with our root pw. I can leave one up for you if you like [17:18:19] thx inflatador ... probably another day. ebernhardson: what are you using again, if you don't mind saying? [17:18:34] dr0ptp4kt: weechat, but it's because i'm odd [17:19:11] i don't like notifications or switching windows, weechat is a text based client. I leave the top ~8 lines in my tmux session showing the chat and i just look at it when i need a distraction :) [17:20:20] i think most text based clients are similar though, i used to use irssi which did the same, but i switched to weechat because it has slack integration [17:23:58] inflatador: yes please leave one up if you can [17:23:59] cool [17:24:37] dr0ptp4kt: you can download chat logs from irccloud and search them with grep or in your favorite text editor. Not great for searching recent stuff bc you'd have to re-download—and it's slow—but if you want to find an old conversation it works ok. [17:28:05] Trey314159: thanks, i see that now with the Settings cog, hiding in plain sight [17:34:18] looks like irccloud went back to 2023-08-25 for the 'All networks' option. [17:37:38] lunch, back in ~1h [17:47:47] dr0ptp4kt: is that when you started using irccloud? or when you joined this channel? It only gives you logs of the channel when you were logged in (or when irccloud was logged in for you). It doesn't seem to make a generic archive of channels available, which is for the best. [17:47:58] I just got my download and it goes back to 2015 on freenode [17:49:26] Trey314159: no, i've had it for a long time. although after the whole fiasco with the other irc network, i had stopped for a long time. i had hopped on quite a bit earlier. maybe it was after i got randomly de-auth'd or netsplit or something. [17:50:30] Hmm. Sorry it didn't have what you are after. [17:50:35] i think i was back in in later july maybe, i guess i can look :) [17:50:37] [2023-07-26 13:18:15] → Joined channel #wikimedia-search [17:50:38] [2023-07-26 13:18:15] * Channel mode is +nt [17:50:38] [2023-07-26 13:18:39] — dr0ptp4kt wanders back in [17:50:52] i mean, a couple months back ain't bad [17:51:10] so maybe it'll keep accumulating history now if i'm lucky [17:51:58] er, a _few_ months [18:26:47] dr0ptp4kt: there is a very much under advertised/under used feature of Stashbot related to your IRC log search questions. It stores IRC messages it sees in the Toolforge elastic cluster for "(an as yet unwritten IRC history search system)" that is still unwritten 8+ years after I had the idea. [18:27:47] dr0ptp4kt: see https://bd808-test.toolforge.org/elastic7.php for the enumeration of the irc-* indexes which apparently go back to July 2015. [18:28:25] these can be poked at via https://wikitech.wikimedia.org/wiki/Help:Toolforge/Elasticsearch#Read-only_access [18:30:50] back [18:45:25] hmm, i realize i don't know what our process will look like for rolling back the SUP consumer to a previous point in time. I can set the appropriate parameters in the helmfile values, but i guess we need to do two deployments in quick succession? First to run it with the parameter, and a second to remove the parameter? [18:46:05] I guess my worry is the thing restarts for whatever reason, and the consumer will go back to the point-in-time a second time [18:49:34] jbond cloudelastic1010 is up at DRAC...last time it failed, I power-cycled and could get into the console. DM me if you have any other questions about it [18:51:13] inflatador: ack thanks ill take a look tomorrow [18:54:04] huh, interesting: [elastic_snaps] Could not read repository data because the contents of the repository do not match its expected state. This is likely the result of either concurrently modifying the contents of the repository by a process other than this cluster or an issue with the repository's underlying storage. The repository has been disabled to prevent corrupting its contents. To [18:54:06] re-enable it and continue using it please remove the repository from the cluster and add it again to make the cluster recover the known state of the repository from its physical contents. [18:54:52] that doesn't sound good [18:55:20] which env/cluster is complaining? [18:55:31] indeed :) I deleted the snapshot from a few days ago via elastic's snapshot api, then tried to take a new snapshot (with a different name) and it gave this error [18:55:51] search.svc.eqiad.wmnet:9243, both delete and the new snapshot issued to same cluster (so not a weird thing with deleting from cluster A, and cluster B not seeing it) [18:56:25] maybe relevant, the delete request didn't actually finish, nginx timed it out [18:56:32] i'm assuming it still finished in the background [18:59:14] the error does make me wonder though, The message is suggesting that each cluster should have a dedicated repository and we shouldn't let multiple things mutate the same repo? Can have multiple readers but should have a single writer. unclear [18:59:58] fwiw other clusters seem to be able to read that repo fine, and they still see the thing i deleted. mhm [19:00:55] I'm curious what it looks like on the swift side. Maybe if the same index is snapped from different clusters, it goes in the same "directory structure" and gets corrupted? [19:03:24] iirc there is a metadata file of sorts in the top that says what snapshots are available, and then the individual snapshots in separate directories, but it's been awhile [19:06:09] ahh, found the delete failure: [cluster:admin/snapshot/delete]]; nested: RepositoryException[[elastic_snaps] concurrent modification of the index-N file, expected current generation [15] but it was not found in the repository]; Caused by: RepositoryException[[elastic_snaps] concurrent modification of the index-N file, expected current generation [15] but it was not found in the [19:06:11] repository] [19:06:27] so prior to me taking the snapshot it had already failed in some way, not sure what this one means either :P [19:10:02] * ebernhardson just does what it says and resets the snapshot, lets hope it doesn't repeat [19:15:23] ACK, will document [19:23:33] * ebernhardson is a bit surprised at how long the delete takes, it's still running [19:24:06] i reset the repo (delete /_snapshot/elastic_snaps and re-submit the same config) then re-issued the delete, this time to localhost:9200 to avoid nginx [19:26:12] maybe i was impatient, it finished this time and didn't bail out. Going to pretend that was some rare race condition i guess [19:27:07] :q [19:28:52] gehel ryankemper sorry for late scratch, will not make pairing today. We can do it after 2PM PST [19:29:25] inflatador: kk [19:46:18] appointment, back in ~2h [19:47:41] huh, maybe our problem is related to this: Sleeping for [3m] after modifying repository [elastic_snaps] because it contains snapshots older than version [7.6.0] and therefore is using a backwards compatible metadata format that requires this cooldown period to avoid repository corruption. To get rid of this message and move to the new repository metadata format, either remove all [19:47:43] snapshots older than version [7.6.0] from the repository or create a new repository at an empty location. [19:48:52] * dr0ptp4kt Thanks bd808 :) [20:04:26] Hello, it looks like elasticsearch did a little pummeling of LVS about 30 minutes ago. Is this known? :) [20:07:21] https://w.wiki/8B32 [20:08:29] elastic1055 seems to be a node involved [20:54:32] brett: hmm, i ran a snapshot/restore between clusters that hsould have used swift as an intermediary, checking if it lines up [20:55:21] for some reason i think that it talks directly to swift frontends though, not through lvs [20:57:25] snapshot ran 19:25 - 19:51 UTC, lines up pretty well with the spike. That's configured with `"max_snapshot_bytes_per_sec": "20mb"` which should have avoided a spike that big :S [20:58:02] i suppose its 20mb * # of partitions (=30) [20:59:38] brett: is there a reasonable limit i should be considering there? We can change the settings but i'm not sure what's appropriate [21:00:23] although i suspect the better answer will be to not send most of a TB through LVS [21:08:17] ebernhardson: I'm not sure! But we're paged when (rate(node_network_receive_bytes_total{device!="lo",cluster="lvs"}[5m]) * 8 / 1024 / 1024) >= 3200 [21:08:54] So maybe below that? :) [21:09:18] sounds fair :) That's the total bytes received by all lvs instances? [21:09:37] That's per LVS host [21:09:59] ok, i'll run some numbers and see whats reasonable [21:10:00] that's megabits/s [21:10:07] okay! Thanks for doing that :) [21:10:13] np [21:21:09] LMK if you'd like me to create a phab task or something for ya [21:22:19] back [21:24:12] inflatador: is the snapshot repo config part of puppet? Poked around but didn't see it anywhere [21:28:31] ebernhardson no, I think it's just from curl cmds...we should probably add it there though [21:28:51] sounds like we've got a little math homework too ;{ [21:28:53] on some rough estimates, our largest index has 32 partitions so we have to expect 32*x. Current x=20, or 5Gbps. I suppose we should reduce 20 to about 8 which is 2Gbps [21:30:03] but better would be to figure out how to have them talk to the swift frontends directly for uploads, it's never going to work great to push TB's of data through LVS. The architecture of that system is for small requests with big responses [21:30:55] inflatador: ok, i suppose i can run a for loop over clusters to apply the 8mb setting [21:31:45] partitions == shards? [21:32:13] inflatador: yea [21:32:47] inflatador: afaict the max_snapshot_bytes_per_sec is per-snapshotting process, and the snapshotting process runs on a per-shard basis. That also means we don't have an obvious per-server or cluster-wide throttle [21:33:59] that's not great...maybe there is a way on the LVS side via user-agent, IP space or something [21:35:28] sadly, notes from a team meeting somewhere in elastic about this setting: There was general agreement that using the same limit for restores and recoveries is a reasonable thing to do since my mainly try to protect disks and not the network with the setting. [21:35:48] so, they had no intention of using this to limit network traffic, it just happens to do that [21:38:23] Worst case scenario, I guess we could use a file-based snapshot and scp it around, but ughhh [21:39:21] i think it'll be fine with limits, we just have to be concious not to ask for more than ~32 partitions to be snapshotted a time. [21:39:36] * ebernhardson wonders what it would do if we ask it to snapshot thousands of shards [21:39:49] not safe to test here :P [21:40:45] * brett starts to sweat [21:44:58] values updated, asking it to take the same snapshot it did before to verify this reduces the network stress [21:46:42] I'll get a ticket started. brett do you have any suggestion re: rate-limiting elsewhere? [21:47:00] I'm afraid I don't :( [21:49:46] oh meh, elasticsearch does incrementals so this verifys nothing :P recovery is still ongoing to relforge so i can't delete the snapshot. Will have to verify tomorrow i suppose [21:50:35] maybe skipping LVS is the way to go, we can talk it over with Data Persistence and see what they think [21:51:57] would have to see if elastic can do round robin, we also wouldn't want it to focus on a single swift frontend. The problem is the elastic side thinks it's talking to s3 which doesn't support any of that (it's all backend magic only aws engineers have to worry about :P) [21:54:36] Been awhile, but I think Swift is like Elastic in that it will forward requests to the node that holds the object [21:55:01] you still want to clamp down on bandwidth though [21:55:34] inflatador: yes, thats the swift frontend/backend split. The frontends recieve requests and route them to the backends similar to elastic [21:55:44] the backends are basically big blobs of data iiuc [21:55:57] indeed [22:02:06] can we rate-limit thru envoy? [22:04:56] hmm, i suppose that seems possible. Would be able to put a per-node limit [22:05:24] not sure if envoy has anything for sharing limits between nodes, never looked close enough, but even just per-node would be reasonable [22:07:43] looking at docs now [22:22:25] Popped https://phabricator.wikimedia.org/T351475 for the issue [22:25:21] school run, back in ~40 [22:43:51] maybe this'd work? Still trying to unpack https://www.envoyproxy.io/docs/envoy/v1.11.2/configuration/http_filters/rate_limit_filter#config-http-filters-rate-limit [23:03:23] back [23:03:59] inflatador: yea, global rate limiting sounds like what we'd be looking for [23:05:09] relforge still restoring....really taking its time [23:05:45] if we used this, I guess we'd register the s3 repo to the local envoy proxy, like http://localhost:123 ? and then have envoy connect upstream to thanos? [23:06:01] inflatador: yea, although ideally have envoy connect to swift frontends [23:06:55] that sounds doable. Wish envoy didn't use entire library paths in its config though ;( [23:07:01] `'@type': type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager` [23:07:44] lol [23:10:07] doesn't look like the swift fe's have envoy :( [23:10:55] oh I guess it doesn't matter, we could just add the swift fe's to the local envoy upstreams? [23:11:05] i suppose general rate limiting fixes the concern with LVS, it's not the best solution and will need more restrictive limits, but will get the job done