[08:36:04] Good morning! [08:36:23] o/ [09:33:47] errand+lunch [13:16:09] o/ [14:08:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143897 CR to add new masters to EQIAD clusters if anyone has time to look [14:14:14] ^^ NM, b-tullis took care of it [14:51:54] getting a lot of `CirrusSearchNodeIndexingNotIncreasing` alerts, I guess because CODFW is depooled and nothing's really changing. I guess we might have to suppress these, but I'm not sure why they didn't fire for EQIAD when it was depooled [14:54:56] inflatador: codfw should be pooled, and writes should still flow regardless of the pooling state [14:55:29] dcausse oops, thanks! My coffee needs to be stronger ;) [14:56:17] unfortunately that means we have a problem [14:56:38] Could just be that the alerts need to be updated, will check [14:56:47] pfischer: I see you merged https://gerrit.wikimedia.org/r/1135010, config patches should be deployed right after they're merged (generally during a backport window) [14:57:26] Oh, so they don’t wait for a train? [14:57:37] inflatador: I remember seeing some of these while migrating codfw [14:57:51] pfischer: no mw-config is not managed by the train [14:59:16] dcausse: Alright, I’ll roll out my change during the next backport window then. [14:59:25] sure, thanks! [15:00:02] I've seen them for at least 3 CODFW hosts, maybe more...so far I haven't noticed any actual problems w/the hosts but I'll take a closer look [15:01:36] pfischer: we're in https://meet.google.com/eki-rafx-cxi [15:49:39] there's a question for y'all: https://en.wikipedia.org/wiki/Wikipedia:Village_pump_(technical)#Is_incategory_searching_broken? [15:54:32] hmm, if the update hasn't hit hours later is probably isn't going to :( [15:57:06] inflatador: I've started reindexing Czech wikis. I'm just doing two of the smaller ones on eqiad first.. assuming that goes ok, I'll move on to the rest [15:57:32] Trey314159 thanks for the heads-up, checking for orphans now [15:57:53] inflatador: LOL it already failed. Ugh [15:58:09] https://gitlab.wikimedia.org/repos/search-platform/cirrus-toolbox/-/blob/main/cirrus_toolbox/check_indices.py this is the script I run to check for orphans, FWiW [16:02:18] this incategory issue resembles T331127 [16:02:18] T331127: phantom redirects lingering in incategory searches after page moves - https://phabricator.wikimedia.org/T331127 [16:03:09] for incategory, first example, edit was 13:58, we have a rev_based_update with weighted tags also marked 13:58, then there is also a page_rerender at 14:46. Would have expected the rerender to at least fix it :S [16:03:34] that does look awfully similar [16:10:28] curious, codfw streaming updater logs in logstash are consistent up until 13:30, then they just stop. [16:10:52] grafana agrees, codfw stopped making progress at 13:30, but why? [16:13:16] pod started 13:30, failed connection to https://meta.wikimedia.org/w/api.php/?format=json&action=streamconfigs&all_settings=true, then has been sitting there are not auto-restarted [16:17:29] restarted the consumer-search pod. Not sure why it didn't auto-restart, and i guess i would have expected an alert about low indexing rate [16:17:39] weird [16:18:45] I see no taskmanagers for that period [16:19:03] now we see "CirrusStreamingUpdaterFlinkJobUnstable" [16:19:07] it seems like it failed so early in the startup it never got that far [16:19:26] probably because the newly stood up instance is now submitting data [16:21:25] the "update rate too low" alert is based on flink metrics so if it was not there it did not fire :/ [16:22:10] :S [16:25:01] we need something better than flink_jobmanager_job_uptime to check if the job is running... [16:27:38] maybe we can just impute zero's on the missing data? [16:29:30] hmm, i dunno... [16:29:35] not sure how [16:29:57] and there's no real holes in flink metrics regarding the job itself [16:30:07] it kept making checkpoints apparently... [16:32:27] perhaps something with: flink_jobmanager_numRegisteredTaskManagers{release="consumer-search"} > 1 [16:32:34] should have captured the issue [16:32:50] it's at 0 during the downtime [16:33:54] hmm, yea that sounds possible [16:34:21] will add another alert [16:36:32] thanks [17:00:47] Dunno if anyone's seen this, but brett showed me how to look at AlertManager alerts in Grafana: https://grafana-rw.wikimedia.org/alerting/list?search=CirrusSearchNodeIndexingNotIncreasing [17:01:38] nice, have been using logstash for that so far but grafana seems nicer [17:02:08] do we graph lastBatchMedianLag anywhere? [17:02:24] we added the estimate of end-to-end lag, but i'm totally missing it in the graphs. can add one [17:03:51] Grafana's nice because it can take you directly to Explorer [17:04:20] ebernhardson: here I think: https://grafana.wikimedia.org/d/8xDerelVz/search-update-lag-slo?var-slo_period=7d&orgId=1&from=now-7d&to=now&timezone=utc&var-source=000000018&var-job=search_codfw&var-threshold=600 [17:04:59] but went absent during the outage [17:05:43] ahh, yea that looks like it [17:06:27] uploaded https://gerrit.wikimedia.org/r/c/operations/alerts/+/1144617 [17:07:28] lgtm [17:09:06] https://grafana.wikimedia.org/goto/QzMYo7-Hg?orgId=1 we had a lot of nodes fail the indexing check over the last ~3h or so, not sure why that happened but it seems to have cleared [17:12:31] inflatador: seems to line up but ignored that completely, thought that was because of the migration, I should have looked closed :/ [17:12:44] s/closed/closer [17:13:14] i totally thought the same and didn't even look closely at those :S [17:16:19] My best guess is that metrics stop being reported for a bit, it triggers based on the `increase` function but I'm not sure. It happened to enough hosts that we should probably look [17:16:49] well the general problem was the codfw updater failed to auto-restart, so no updates were flowing after 13:30 [17:17:18] i restarted the updater, david added a new alert that would have noticed the failure to restart. The existing alert had the same problem, no value causing no alert [17:19:34] Oh yeah, I just saw that CR come through. I'll add a note to our docs to check the streaming updater when we see a bunch of CirrusSearchNodeIndexingNotIncreasing alerts [17:45:25] odd...the UpdateSuggesterIndex.php log is on mwmaint1002 running against eqiad, but i don't see an eqiad log dir for the job in /var/log/mediawiki [17:46:28] ebernhardson: yes... was searching for them but I'm afraid they were deleted when we marked that job 'absent' :( [17:46:37] oh, yea i suppose that makes sense [17:46:43] dinner [17:49:35] reviewing the code, error message feels like a coincidence. it should have had many more retry messages for that part to have failed it [17:50:55] lunch, back in ~40 [18:38:28] back [18:54:36] looks like we're gonna have to extend https://gerrit.wikimedia.org/r/c/operations/puppet/+/1140519 to the elastic hosts as well, I should've caught that one earlier [19:14:25] CR for adding the cluster names (chi-,psi-,omega-) to our cert SANs: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143633 [19:15:27] inflatador: seems plausible to me, +1 [19:15:59] ebernhardson ACK, thanks. Will take a look at your DNS patch today as well...been a little underwater ;) [19:16:09] no worries [19:25:11] * ebernhardson knows about that trailing dot in dns...but never remember when to use it :P [19:28:55] yeah, I'm just aping what I see in the rest of the repo [19:35:12] hmm, now I'm starting to get cold feet [19:36:04] which part? [19:36:10] I'm not sure we want to remove the DYNA record for `search.discovery.wmnet` [19:36:36] hmm, is anything using that currently? I guess i was thinking it's unused. We could probably keep it though [19:37:14] Oh yeah, good point. I guess we aren't using discovery records at all yet [19:38:45] Except I guess we'll need to change https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/c9823e191d649ddce9ab9dc8f7a5ec1fd85bd5e4 [19:38:46] hmm, i guess it was added 2017 each, maybe safest to simply keep it [19:39:27] yea we'll need to align things as they go down the chain [19:40:14] I'll go ahead and merge/run DNS update since I don't see anything in codesearch except the above [19:40:54] looks like i have https://gerrit.wikimedia.org/r/c/operations/puppet/+/1143622/ prepped for the envoy.yaml change [19:41:05] although...hmm [19:41:33] yea that should be it [19:42:55] I rechecked that patch, let's see what Jenkins thinks [19:50:38] hmm, I'm getting NXDOMAIN for `search-chi.svc.eqiad.wmnet` from cumin [19:53:07] hmm [19:54:11] I just pinged in Traffic. Not seeing anything wrong so far [19:54:17] inflatador: you might need to do the bit from https://wikitech.wikimedia.org/wiki/LVS#Add_the_DNS_Discovery_Record [19:54:28] although, i guess thats one step before this. hmm [19:54:48] https://wikitech.wikimedia.org/wiki/DNS#Changing_records_in_a_zonefile ? [19:55:59] I did run `authdns-update` which is just like puppet-merge, but I didn't do the dig one-liner [20:01:16] hmm, this seems to imply that was all that was needed :S [20:11:24] inflatador: i suspect what we need is https://gerrit.wikimedia.org/r/c/operations/puppet/+/1144647 [20:12:05] but not 100% sure... [20:22:17] We might need some extra stuff in the puppet plans themselves? Or would that just be for creating a dummy service so that variables and DNS names match? [20:23:33] ryankemper do we have a ticket for the https://slo.wikimedia.org/?search=wdqs-update-lag alerts yet? I can make one it not [20:24:02] tbh i'm not entirely sure, it's kinda-sorta a new lvs, but not really. [20:25:13] yeah, I'm still trying to wrap my head around this too [20:26:44] I'll talk to Ryan about it at pairing as well. Basically as long as we stick to the original goal of switching traffic between datacenters and don't get too lost [20:27:06] err...speeding up the process of switching traffic between DCs, that is [20:27:44] i suppose the complication is mixing in the change from search -> search-chi [20:33:11] Speaking of, I've been getting a lot of errors connecting to search endpoints lately. I opened T393911 to look into it, but have you noticed similar problems? [20:33:12] T393911: Figure out why OpenSearch operational scripts frequently fail to connect - https://phabricator.wikimedia.org/T393911 [20:33:58] Like I just ran check_indices.py and got a connection error. I never had the problem when we were doing CODFW, but it happens almost every time I run a script against EQIAD [20:35:55] hmm, can't say i've noticed it [20:37:31] hmm, but i can reproduce it with a for loop [20:43:05] hmm, tested search, search-https, and search-omega-https. Those lists from config-master all seem fine... [20:49:19] without any proof...feels vlan-ish? but i dunno...i can't seem to find a host that can't be communicated with from a few test hosts [20:49:39] but if i repeat requests for search.svc.eqiad.wmnet enough times (20-50) i will get a `no route to host` error [20:49:50] Yep, that's exactly what I'm seeing [20:50:26] but i took both the list of nodes in config-master, and those from /_cat/nodes, everything listed can be talked to from the 3 hosts i tried [20:52:01] I thought maybe they added some rate-limiting somewhere in the stack, since it never happened during the CODFW maintenance. I guess I can try doing some calls to CODFW and see if I can get it to happen again [20:53:38] codfw gets stuck at cirrussearch2091.codfw.wmnet [20:54:12] 2091 has a hardware issue [20:54:15] inflatador: yeah we do need a ticket for the slo alert [20:54:38] https://config-master.wikimedia.org/pybal/codfw/search still shows 2091 as pooled [20:55:21] we don't have answers today, only more questions ;P [20:56:05] but if the problem is that broken hosts are not automatically getting depooled, at least that's something [20:58:37] curiously though, i can do 100 requests to https://search.svc.codfw.wmnet:9243 without failure, search.svc.eqiad.wmnet doesn't get that far [20:58:59] ryankemper I created T393966 for the SLO stuff. It's just a placeholder [20:59:01] T393966: Update WDQS SLO lag queries to reflect graph split changes - https://phabricator.wikimedia.org/T393966 [22:03:35] still investigating, but it does seem like a VLAN issue again [22:04:12] inflatador: yea, my best guess would be to get all the ip's and try them with `ip route get fibmatch ` that cmooney showed in the prior one [22:04:22] from an lvs host [22:04:53] I'm around if I can help [22:05:11] this is cirrussearch2091? [22:05:43] topranks: we aren't 100% sure, but this example reliably gets no route to host from lvs: https://phabricator.wikimedia.org/T393911#10814017 [22:06:32] topranks actually no, I'm looking at eqiad for now. I think cirrussearch1123.eqiad.wmnet isn't directly connected to lvs1019, for example [22:06:34] those messages can mean a variety of things I find (no route to host) [22:07:05] we just had something page for an lvs, is this separate? [22:07:13] should be separate [22:07:13] I fired up our old script https://gitlab.wikimedia.org/repos/search-platform/sre/lvs_l2_checker but it doesn't seem to find it [22:07:43] sry I'm mixing myself up, what host was the test script in the phab link above run from? [22:07:58] topranks: from mwmaint1002.eqiad.wmnet, but i also tried from stat1008.eqiad.wmnet (different vlans) [22:08:09] same error [22:08:49] np, I'm confused too :) . I've just run the `ip route` command from the script on `lvs1019`: https://paste.opendev.org/show/827744/ [22:10:49] The script does return `CRITICAL: Node ms-fe1016.eqiad.wmnet not connected at layer 2` and that device is in the same rack as cirrussearch1123: https://netbox.wikimedia.org/dcim/devices/6256/ [22:11:11] also the script immediately bails out at the first error, so if there are more it won't display them [22:12:55] yeah lvs1019 is 100% missing that vlan [22:13:09] private1-f8-eqiad (1061) [22:14:12] so we fucked up and never added it [22:14:26] give me a minute or two there guys [22:14:28] No worries, who among us has not forgotten to trunk down a VLAN at some pt or anoher [22:14:36] racks F8/E8 went live a few weeks back [22:14:40] no it's not an error [22:14:45] it's a new rack [22:17:00] hey, if nothing else this gives us some justification to polish up that script and try and re-submit it, does that sound OK? I'm pretty sure it would've caught this if it were active [22:29:23] yeah it would have... also we need to add this to our checklist for go-live for these things [22:45:39] OK, time for my son's orchestra recital...see y'all tomorrow!