[00:02:09] ebernhardson I wrote up the incident here: T356941 . Feel free to add/edit as needed. Sorry for the trouble [00:02:09] T356941: Stale data/failed queries on wikidatawiki index - https://phabricator.wikimedia.org/T356941 [00:07:12] snapshot is still going...will check on it tomorrow [08:53:01] errand, have to pick-up a replacement car, unsure how much time it'll take [14:13:33] o/ [15:39:14] looks like the Elastic snapshot repository corrupted last night...maybe due to taking my 2 snapshots at the same time. Should be an easy fix, but we def need to monitor this and maybe have different buckets for CODFW and EQIAD [15:41:03] I also see some chatter about swift issues, so that could be part of the problem too [16:01:11] \o [16:01:51] curious, i know elastic has a limitation that only one cluster is supposed to write to the swift container at a time, but i thought it was fine to be running multiple snapshots from the same cluster [16:05:38] I think we want to keep the snapshot repo the same between different clusters...just have to be careful about snapshotting 2 at the same time. Or we could register multiple snapshot repos for each cluster, but I dunno if that's worth the effort [16:06:22] created T357018 to dig in to it, LMK if you have a preference either way [16:06:22] T357018: Fix/monitor Elastic S3 repository status - https://phabricator.wikimedia.org/T357018 [16:08:05] inflatador: seems reasonable. I suspect based on elastic's docs they expect you to have a container per cluster, but we only do manual snapshotting o it's probably fine [16:08:52] o/ [17:12:40] lunch, back in ~30 [17:41:44] back [18:21:51] hello - seeing a lot of new errors for cloudelastic atm https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-deploy-1-7.0.0-1-2024.02.08?id=atPxiY0BXQUFBRtCNFop [18:22:16] "Received cirrusSearchElasticaWrite job for an unwritable cluster cloudelastic." [18:22:48] hnowlan: hmm, i'll look into them but it's generally not a concern. In that case unwritable means "I'm not configured to allow writes there" and not "writes should be going there but aren't" [18:24:13] the question would be how they are getting created when they shouldn't be, something isn't checking properly [18:24:31] it looks like they're subsiding atm [18:27:40] curiously they all have the same reqId, suggesting they all came from the same process. That sounds like saneitizer and came in while it was running. We specifically changed that over so it wouldn't create writes, but will check again. probably buggy [18:27:53] ah, okay [18:28:19] Should those other errors from last week have subsided by now? [18:28:54] hnowlan: hmm, which ones? [18:31:03] ebernhardson: https://phabricator.wikimedia.org/T356526 [18:33:23] hnowlan: hmm, i would have expected them to, yes. Maybe thats related somehow to the current private ip migration? I'll need to look closer [18:37:13] just finished removing/re-adding the elastic snap repo to all envs (codfw/eqiad/cloudelastic/relforge). I'm going to snap a small wiki to make sure it worked [18:37:18] or small index, I should say [18:39:26] ebernhardson: it started late on the 30th of January, and I notice there was something to do with the private IP migration happening around that time in SAL https://wikitech.wikimedia.org/wiki/Server_Admin_Log/Archive_75#2024-01-30 [18:41:32] ` curl -XGET -H "Content-Type: application/json" "https://search.svc.eqiad.wmnet:9243/_snapshot/elastic_snaps/wikiversity/_status"` to check the new snap status [18:45:59] do we know for sure the `Search backend error during get of {indexSuffix}.{docIds} after {tookMs}: {error_message}` came from Elastic? There is a discussion in #sre happening about lower edit rates for wikidata [18:46:07] err...from Cloudelastic that is [18:47:01] hnowlan: oh! i remember those now, actually those are a different problem not related to the ip migration. That is still waiting for the train to roll forward. It looks like the blocker was just closed so it might still roll today [18:48:16] inflatador: that's not related to wikidata at least, those errors have been coming since the 30th [18:48:20] ebernhardson: ah okay [18:49:14] looks like the snapshot worked, going to try to restore to relforge [18:49:32] hnowlan ACK, thanks for clarifying [18:54:23] re-confirmed the default cluster patch. It's working as expected on testwiki, so should work on the main wikis once the train gets there [18:55:57] cool, thanks! [19:09:21] looks like the `wikiversity` snapshot restored OK to relforge [19:34:26] gehel we're in SRE pairing if you wanna join [19:59:15] medical appointment, back in ~90 [20:15:46] oops, sorry, I see the ping about pairing session just now. Next time maybe... [21:33:02] sorry, been back [21:34:50] ryankemper hit me up if/when you need a review on https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/961878 [22:01:58] should be good, fixing up CI stuff after lunch [22:15:06] I'm gonna go ahead and ban the cloudelastic hosts that need to migrate ( cloudelastic1005*,cloudelastic1006*,cloudelastic1007*,cloudelastic1008*" ) ....that leaves us with 6 hosts (2 public 4 private) . That should make it a little safer to do these reimages [22:38:09] ebernhardson I saw the backfill patch was just merged, was it active yesterday? Because I started the ForceSearchUpdate script then...LMK if I need to start it again [22:39:02] inflatador: hmm, backfilling was enabled yesterday here: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/998559 [22:40:15] ebernhardson ACK, PEBKAC error. Was looking too far back in the scrollback ;( [22:40:27] i'm a repeat offender there [22:49:17] cool, the script is still running. Ran it with these args: `--wiki wikidatawiki 2024-02-07T19:00:14Z 2024-02-07T23:00:14Z` [22:57:20] cloudelastic QPS is falling off a cliff, I'm going to unban nodes and see if that helps https://grafana.wikimedia.org/d/000000460/elasticsearch-node-comparison?orgId=1&var-cluster=cloudelastic&var-exported_cluster=cloudelastic-chi&var-dcA=eqiad%20prometheus%2Fops&var-nodeA=cloudelastic1001&var-dcB=eqiad%20prometheus%2Fops&var-nodeB=cloudelastic1010&from=now-30m&to=now [22:59:46] well...I guess that's nothing to worry about if I zoom out a bit [22:59:49] https://grafana.wikimedia.org/d/000000460/elasticsearch-node-comparison?orgId=1&var-cluster=cloudelastic&var-exported_cluster=cloudelastic-chi&var-dcA=eqiad%20prometheus%2Fops&var-nodeA=cloudelastic1001&var-dcB=eqiad%20prometheus%2Fops&var-nodeB=cloudelastic1010&from=now-3h&to=now&viewPanel=23 [23:02:46] anyway, enough twiddling knobs for me...back tomorrow