[08:36:39] errand [10:47:53] errand [11:11:17] lunch [11:48:01] ryankemper: for when around: elastic1043 seems to be down. Volans had a look. See -operations [13:45:16] Errand, back in 20' [15:57:06] \o [15:59:27] so i ran the wcqs streaming updater overnight, it didn't generate a lot of fetch-failures but there are perhaps 2/hr [16:19:04] o/ [16:19:41] ebernhardson: cool, looking [16:20:53] with MCR filtering they must only be caused by the mysql replication lag [16:22:02] ebernhardson: did you produce to kafka-jumbo or kafka-main? [16:24:43] kafka-jumbo apparently :) [16:27:08] seems a lot and mcr filtering does not seem to work [16:27:45] e.g. M113360905 which does not have the mediainfo slot (https://commons.wikimedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=revisions&pageids=113360905&rvprop=roles%7Cids) [16:30:04] hopefully it might be because you did not run the last version of the code with (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/745463 & https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/742944) ? [16:30:46] oh, indeed i hadn't looked at the timestamps close enough, it is producing quite a few [16:30:59] no this is the latest version from archiva which includes that [16:31:08] 0.3.96 [16:31:28] so something's not working as expected [16:32:12] i looked at z's last fix. 0% chance i would have found that :P [16:32:23] still no clue what it's doing or why that fixes anything [16:33:29] the first filter (on hostname) was forgotten/erased if the MCR filter was applied [16:35:58] might be some config mess [16:36:17] i dunno, it's just such a simple change, a few lines outside of test, but i can't seem to decrypt it :P [16:37:18] well, will ponder on what this is doing. might come pu with somehting [16:40:56] yeah not sure to understand what's happening here either :/ [16:51:37] hmm, do we need to do something separate to turn on apache commons logging? The taskmanager logs have a bunch of kafka/flink messages but nothing from http [16:52:07] afaict, we are using log4j (flink defaults) as the backend? [16:58:09] ebernhardson: yes it's log4j by default, you can tune conf/log4j-session.properties or conf/log4j.properties (can't remember which one is used with yarn) [17:04:25] the root logger is INFO by default so nothing will come out of apache http components at this level if you're trying to debug this [17:11:31] perhaps this needs a bridge like log4j-jcl and that's what you're asking actually :) [17:11:39] in which I dunno as I never tried :/ [17:11:58] s/in which/in which case/ [17:14:06] poking it, still not sure. I suppose i don't even know that http logs would be meaningful it was just suspicious they were missing. On the mw-oauth-proxy i had to do something odd to make apache logging work but i'll have to figure out what that was [17:24:48] elastic1043 is down since a couple hours. should we check mgmt or ignore? [17:26:59] mutante: no clue what's going on, apparently v.olans took a quick look this morning but not sure he got to mgmt to debug deeper [17:27:23] the cluster itself will stay healthy and is good with 3 machine sout [17:27:43] probably can take more than 3, but we take 3 out routinely [17:27:57] ok, that's good ! [17:28:16] only thing is it was still in "unhandled" [17:28:32] probably waiting for ryankemper to start, it's still early here [17:29:14] \o [17:29:21] Yeah about to take a look [17:29:22] o/ [17:29:37] ok, cool :) [17:30:00] * ebernhardson now remembers, of course mutante is like 200 miles down the road and knows what time it is here :) [17:54:09] Having trouble getting into the mgmt console. I'm on a new laptop so that's almost certainly why, but haven't quite figured it out yet [17:54:12] Getting `Unable to negotiate with UNKNOWN port 65535: no matching key exchange method found. Their offer: diffie-hellman-group14-sha1,diffie-hellman-group1-sha1` [17:54:46] My ssh config should be the same as it was previously, but maybe I'm on a different openssh version [17:55:13] ryankemper: seems related: T171041 [17:56:21] dcausse: thx, nice find [17:58:42] ryankemper: -oKexAlgorithms=diffie-hellman-group14-sha1 -oCiphers=aes256-cbc should work [17:59:00] I say whould because it doesn't on my macos but on linux should work [17:59:23] volans: this box is macos actually [17:59:52] Oh that still worked though [18:03:22] Looks like it might be a PSU failure or unplugged cable or something because the box is staying in a power off state [18:06:35] (Filing ticket with dcops) [18:07:46] Hmm this host will be replaced soon with https://phabricator.wikimedia.org/T279158, probably worth it to just expedite that instead [18:11:52] Trey314159: do you have any reindex work ongoing by any chance? [18:13:03] ryankemper: not at the moment! [18:13:13] (and none planned for this week) [18:14:18] excellent, i'm gonna work on getting the new eqiad elastic hosts into service over today/tomorrow so we can decom 1043 (as well as the other healthy hosts) [18:23:08] not all, but it looks like some of the failures on commons are due to a vandalism workflow, it apparently involves editing the page immediately prior to deleting it [18:27:30] i guess will try and write up something to read in the fetch failures topic and try and characterize the failures [18:27:46] poke apis and try and explain them or something, i dunno [18:29:50] if the delete happens right after an edit on a file with structured data attached to there's nothing we could do [18:30:04] except being faster than the delete [18:30:37] right, but then that means we can't really use the fetch-failures topic directly as a measure of whats wrong, thats why i was thinking about a process to read the topic and find the actual failures [18:31:00] the reconciliation code I'm writing will poke at the mw api to double check but that's not really helping [18:31:16] * ebernhardson totally forgot about that [18:31:51] other approach is getting a hint on the 404 response [18:31:53] all i would be doing is poking the mw api as well :) Basically if the page doesn't exist anymore assume it's not really a problem [18:32:30] hmm, maybe something on the 404's directly [18:32:51] I think we wrote a ticket about that [18:33:02] but it's not entirely trivial to do on the MW side [18:33:23] yea, mediawiki has an annoying bifurcation when things are deleted [18:33:46] * ebernhardson wishes deletes were just another revision on top that "hid" it, but i understand thats actually quite complicated :P [18:34:08] yes I wish that too :) [18:34:14] this is the ticket: T289770 [18:34:15] T289770: Add hints in response headers for 404 responses in Special:EntityData - https://phabricator.wikimedia.org/T289770 [18:35:16] but I was hoping that the edge cases like the one you describe were rare enough to just rely on the reconciliation strategy happening later in a spark job [18:35:28] would be sufficient [18:35:56] if it's complicated...we can probably ignore them. The only real problem at this point is that the failure log has a bunch of not-really failures [18:36:43] yes we should get a sense of the false positives, the one I checked was due to broken MCR filtering [18:37:09] i've checked 4, 3 were vandalism and the last i can't explain (and it didn't hit the retry loop) [18:37:40] perhaps jumbo is having some delay [18:37:59] I mean the retry should happen on events youger than 10sec [18:39:14] i dunno, in this one the ingestion_dt is 15:48:23, and the failure message it tagged 15:52:29, seems it would have had enough refresh time. The msg here is Cannot fetch entity M54514678 revision 612591819 [18:39:48] but it doesn't include a url which is different than the others [18:40:39] can construct the url of course, and https://commons.wikimedia.org/wiki/Special:EntityData/M54514678.ttl?flavor=dump&revision=612591819 works currently [18:41:17] hm.. looking [18:41:57] I see: revision 612591819 failed 4 times, abandoning [18:42:24] hm fetch_error_type: UNKNOWN [18:42:32] that's not a 404 [18:43:13] hmm, maybe my log is cutoff, on my screen the next line after the revision # is `% Reached end of topic eqiad.rdf-streaming-updater.fetch-failure [0] at offset 590662`. maybe some odd console thing happening [18:44:14] yes, testing thats just `less` interacting badly with stdin and stderr [18:44:20] :( [18:44:25] :/ [18:44:57] seems i should have figured out shell redirection one of these years :P some day.... [18:45:04] it was backfilling event time is 2021-12-08T12:37:45Z vs ingest_time at 2021-12-10T16:03:36.279Z [18:45:16] during backfills we get a lot of false positives [18:45:32] mostly deletes? [18:45:53] yes or "already seen" events but that's another topic [18:46:48] well, i guess will try and figure out which of these failures if any are mcr problems and try and ignore the rest [18:47:15] are you going through the webproxy or the api-ro endpoint with http routes? [18:47:33] so far pasting urls into my local browser :) [18:47:40] I mean flink :) [18:47:56] ahh, umm it's setup however we set it up in video chat a month ago. lemme look [18:48:41] I http_routes in your conf so that must be going through MW app server directly [18:48:50] which is good [18:49:47] still unsure why you've got "UNKNOWN" 4 times [18:50:04] if it's rare it's probably not too bad [18:51:37] ok, so maybe things are mostly working [19:06:02] edit at 2021-12-14T18:58:07Z and delete at 2021-12-14T18:58:41Z we could hope that running over kafka-main we get quicker to the MW api... [19:11:18] not sure ingestion time is 2021-12-14T18:58:08.743Z so something might have been buffered [19:11:52] ah you might be running without an initial-state so we must be buffering all events [19:12:57] dinner [19:15:58] hmm, yea i didn't run the bootstrap process [19:34:57] Our search logstash instances are on `6.8.21` now and on the elastic side of things all the rolling restarts are complete (eqiad/codfw/cloudelastic/relforge) so we're all done with log4j mitigation work [19:58:24] excellent