[09:44:28] errand [10:42:29] Morning Search people! I remember asking you nicely in fall of 2022 if you might be able to give the WMDE team working on Wikibase.cloud a short introduction (like an hour) to "search + search infrastructure" and it sounded like gehel said it might be possible. Would that be something that you might still be able to fit in? Cheers! [10:47:36] tarrow: yes, I'm sure it's possible [11:00:36] cool! That would be amazing; how would be best to plan it? Want me to email someone with manager in their title? Don't want to tread on any toes or take up time you don't have :) [11:01:48] tarrow: absolutely possible! Do you know more precisely what you would like to cover? [11:02:15] Send me an email with the list of potential attendees and I'll schedule some time [11:03:53] tarrow: or if you want to quickly jump in a meet, we can discuss the details [11:04:08] I have a few minutes before the kids are home for lunch [11:04:29] gehel: that would be amazing! PM me a meet link? [11:06:03] meet.google.com/xfr-gjiw-pdw [11:06:17] dcausse: feel free to join as well if you're interested [11:28:11] lunch + errands [11:28:18] lunch 2 [12:37:52] dcausse: Sent quite the essay in email form; any questions at all about our setup let me know. The individual components as well as the infrastructure as code are all open but if you want a guide through the mountains please don't hesitate to ask :) [13:13:57] tarrow: thanks! might be interesting to also share the size of your indices, perhaps the output of _cat/shards? [13:38:20] dcausse: alert firing about WDQS latency above 10' (see #wikimedia-operations) [13:38:31] yes saw that [13:38:38] Looking into the graphs, I find this concerning: https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1&viewPanel=14&from=now-90d&to=now [13:38:41] something's going on but not sure to understand what [13:39:33] yes thanos usage needs to be looked at but probably unrelated to the current status [13:40:45] the graphs on eqiad and codfw are quite different. not sure why. [13:41:16] Oh no, I'm just looking at different time frames [13:42:59] pfischer: the failure modes here might be interesting to you, as they might relate to the failure modes we might see on the Search Pipeline [13:44:29] wdqs codfw (both internal and external) are lagging by > 45'. Should we depoole? [13:45:02] With the risks of crashing eqiad with the added load (we have new servers on the way to increase capacity, but they're not there yet) [13:46:03] yes we should depool codfw [13:46:16] I'm on it [13:47:12] if I understand correctly we lost a whole row? [13:49:34] * gehel is still reading the backlog [13:49:48] looks like both wdqs codfw clusters have already been depooled [13:50:22] ok so now it's backend "noise" [13:53:21] all 3 codfw clusters are red so we're likely rejecting updates [13:54:17] ^ elasticsearch you mean? [13:54:30] yes [13:54:39] we're missing 10 nodes [13:56:01] how come that we're getting red if we lost a single row? We should always have replicas across multiple rows. [13:56:08] inflatador: just in time for the party! [13:56:25] sounds like it ;( [13:56:50] seems like we just have to wait for the network issue to be resolved :/ [13:56:58] yeah, not much to do [13:57:14] eqiad is OK though? [13:58:10] Checking the main ES cluster, the red indices are still the 4 title suggest that were already red before. Lots of yellow indices, but that's entirely expected if we loose 10 nodes [13:58:37] yes only eqiad [13:58:43] I mean eqiad is fine [13:59:45] at this point, I'm more worried about WDQS and how it is going to recover from this [14:00:23] we were red before? Could it be a reindexing/aliasing thing? [14:00:57] it's not running, in a restart loop most probably so hopefully it'll recover once all services are back up [14:02:14] inflatador: yes some failed indices, I need to clean them up [14:02:40] the red title suggest indices are probably a missing cleanup during a previous reindexing. I checked the red indices on the main cluster and they were all title suggest with no active aliases. Not a big concern (still needs to be cleaned up, and maybe have a look at the stability of that update job. [14:02:41] FWiW, wdqs data reload is ~80% complete in eqiad, 10% complete in codfw [14:03:30] it's very probably that we failed a bunch of writes in elastic@codfw [14:04:46] we'll need to run a reindex for that period [14:08:00] Let's wait until things settle before doing anything. [14:08:29] inflatador: could you start an etherpad or phab task to track what needs to be addressed once network is stable again? [14:09:19] gehel on it [14:10:42] identified so fare: [14:10:43] * run a reindex for search over the problematic period [14:10:43] * check that wdqs internal and external are repooled in both DC [14:10:43] * check that ES@codfw is fully green and has the expected number of nodes (maybe minus the 4 red indices already identified) [14:10:54] starting to see recoveries in operations chat [14:11:11] * check that WDQS updater has recovered as expected [14:12:17] dcausse: do you know why the WDQS updater would crash when loosing a row? I would expect the k8s pods to be rescheduled on a different host and things to continue working. [14:13:13] gehel: if it cannot to any of its dependent services it'll crash: k8s api, swift, kafka [14:14:19] I don't have a clear list of what crashed, but I would expect kafka and swift to be resilient to loosing a row (I might be naive) [14:14:28] no idea about the k8s apis [14:14:54] so your hypothesis is that flink failed as a cascading effect from a failed dependency? [14:15:00] we can look at the logs but I'm not surprised that it's crashing [14:15:13] yes [14:15:58] ok, no action on our part to improve that stability then. We can't expect to have a working updater if it can't talk to kafka. [14:16:23] Let see if recovers on its own [14:16:49] new task as requested: https://phabricator.wikimedia.org/T327175 [14:17:03] Seeing recoveries for streaming updater in operations [14:17:35] inflatador: thanks! [14:21:44] next question, who wants to do what? I'll delete the red indices [14:26:10] inflatador: yep, deleting those indices will at least reduce alerting noise. Good thing during this kind of outage! [14:27:20] inflatador: I'm assuming that you and ryankemper should take care of most of the steps in that phab task. With the exception of " Verify WDQS streaming updater is healthy", where David is probably needed. [14:27:39] Note that the WDQS clusters need to be repooled, not the individual nodes. [14:28:15] gehel so it's something different than just putting the nodes back into pybal? [14:29:01] yes. Still in pybal, but pooling the cluster, not the nodes. See https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Remediation for examples [14:29:45] ah OK, I have done this before. thanks [14:32:21] OK, back to green for all 3 elastic clusters in CODFW [14:32:40] Cool! Once more, our elasticsearch clusters are super resilient! [14:37:05] should we wait to repool CODFW WDQS? Still seeing quite a bit of flapping with elastic nodes [14:38:19] inflatador: yes, we need to wait. Those nodes have not started to recover and are lagging behind by > 1h. https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=8 [14:49:47] ACK, waiting [14:53:29] we still have higher than usual response times on codfw, but that's expected while shards are recovering: https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&from=now-3h&to=now&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&refresh=1m [14:54:08] and tbh, the higher response time is only barely visible outside of the 99%-ile, so we're pretty good! [15:49:48] we should wait before repooling anything [15:55:33] looks like wdqs has mostly recovered on lag. [15:55:42] dcausse: do we know if the update pipeline is healthy? [15:56:13] if it's running it's healthy :) [15:56:25] :) [15:56:30] I don't see any harm in waiting, still seeing some flapping [15:57:10] they need to reboot a switch still [15:57:24] so there might be some downtimes again [15:59:01] \o [15:59:07] o/ [15:59:11] looks like codfw is depooled? [16:07:48] the elastic cluster there is idle [16:08:02] yeah, you missed the fun. Ongoing switch issues in CODFW [16:08:06] ebernhardson: there's a massive outage in codfw :) [16:08:48] wow, thats going on a long time. graphs show idle for almost 30 hours now [16:09:30] yes outage started around 1pm UTC [16:09:37] 30hours? [16:09:51] https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&refresh=1m&from=now-2d&to=now&viewPanel=22 [16:10:02] since 1/16 07:28 UTC [16:10:19] oh wow that's definitely not expected [16:11:34] hmm, well i suppose wait till everything else is back online before poking it much [16:11:40] Red indices wouldn't cause that, right? [16:11:52] no [16:11:54] shouldn't [16:11:59] might mediawiki [16:12:02] *be [16:12:37] yea if mediawiki is depooled in that cluster that would do it, otherwise there is a config flag in mw-config that can do it but i doubt anyone else would change that [16:12:52] codfw is still receiving writes [16:14:23] yes mw@codfw seems to be depooled https://grafana.wikimedia.org/d/000000550/mediawiki-application-servers?orgId=1&var-site=codfw&var-cluster=appserver&var-node=mw2268&var-php_version=proxy:unix:%2Frun%2Fphp%2Ffpm-www.*&from=now-2d&to=now [16:14:31] around the same time [16:15:21] ahh indeed, yea looks like mw itself is depooled [16:19:15] inflatador: re red indices, you should be able to run `python3 /srv/mediawiki/php/extensions/CirrusSearch/scripts/check_indices.py | jq .` on mwmaint1002 and get a report about what indices are missing/extra across the cluster. I always reference that before deleting anything [16:19:51] taking over as IC for CODFW so might be out of pocket for a bit [16:19:52] i ran it just now, it also suggests there is some sort of titlesugest problem with the new wikis :S [16:20:02] kk [16:20:20] inflatador: thanks for stepping up as IC! [16:22:56] np, most comms happening in mediawiki_security if anyone wants to follow [16:40:48] dcausse: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/880976 should fix the problem with completion indices not cleaning themselves up. At a higher level the problem is my adjustment to return Status values from Util means we lose type information and phan can't report failures [16:41:09] * ebernhardson wants generics :P [16:41:09] looking [16:41:12] :) [16:42:42] phan claims to have "primordial support for generic classes" (https://github.com/phan/phan/wiki/Generic-Types) maybe will see if we can use that [16:43:35] i suspect it wont be advanced enough for this use case, but can check it out [16:52:05] dinner, back in a while [16:52:49] Status might be tricky to adapt to this template thing perhaps [16:54:28] in a quick test it's also not optional, adding the @template makes all sites that use it report T as unknown. Not going to be able to go out across mediawiki and fix all those [16:56:15] hm seeing plenty of unrelated phan issues [16:56:32] phan config might have been updated? [16:58:38] hmm, that should be set baesd on the composer.json, althogh i suppose a new version of phan itself might do it [17:03:05] we are on latest mediawiki-phan-config, but that was released in october. This reports using phan 5.4.1, same version that passed last week. I can put up another patch to fix these but curious how the results are different [17:13:45] must be something new though, my DNM patch from last week passed CI, now it fails phan [17:14:28] oh, Elastica was updated in vendor [17:14:51] ah yes I merged a patch yesterday an only ran cindy [17:24:40] easy enough to fix. The errors around string indices look like only deprecations, the code still accepts strings but complains and the type annotations don't accept strings anymore [17:29:40] oh wow the Search::addIndex would have caused massive deprecation spam [17:31:42] hm surprised that cindy did not catch https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/880982/1/includes/Search/SearchRequestBuilder.php [17:33:19] it looks like the code still accepts strings, it just complains so cindy should be fine [17:33:44] amusingly all the code does is turn the Index object back into a string :P [17:38:34] hmm, tests show that didn't all work as expected :S looking [17:38:53] maybe just a mocking problem [17:38:55] poor mocks I guess [17:39:47] yea it's a blank Client mock [18:05:07] going offline [18:59:37] inflatador: ack, I’m back around now and catching up on backlog [19:15:57] sounds good. I declined pairing session, still working on incident report [19:21:49] inflatador: anything we should be following up on during our pairing session (even if you're not there)? [19:22:21] inflatador: are you doing the incident report for that switch issue? [19:22:55] yes, report at https://wikitech.wikimedia.org/wiki/Incidents/2023-01-17_asw-b2-codfw_failure_redux [19:23:13] gehel re: pairing nothing urgent I can think of [19:23:34] WDQS data reloads still proceeding as of a few hours ago [19:24:02] I see that wdqs@codfw has been repooled [19:24:55] we still have to reindex over the incident period [19:25:04] we can do that with ryankemper during our pairing [19:44:39] wrt codfw reindex, we're going to regenerate the last ~6 hours like so: [19:44:42] https://www.irccloud.com/pastebin/DfdBEk3u/ [19:45:44] Actually slight correction on date ranges: `'2023-01-17T12:00:00Z' '2023-01-17T17:30:00Z' [20:17:57] inflatador: (not critical, just for your context) we disabled notifs on the not-yet-in-service wdqs codfw hosts so we won't have to keep extending the downtime while waiting for the data reloads/xfers/etc https://gerrit.wikimedia.org/r/c/operations/puppet/+/881000 [20:25:28] Thanks ryankemper , I think I was supposed to do that Friday but forgot ;( [20:26:28] We always have more balls to juggle than hands :P [20:42:43] ryankemper getting reports of lag on wdqs1016 ( https://phabricator.wikimedia.org/T327210 ) , LMK if y'all are able to take a look [20:43:33] we these servers online in https://phabricator.wikimedia.org/T314890 if that helps. [20:43:43] err..."brought these servers online" [20:57:07] (fixed now) [20:57:58] as an aside, I'm very glad that https://phabricator.wikimedia.org/T238751 finally got implemented ! felt like that one was hanging over our heads for quite some time