[08:33:02] pfischer, dr0ptp4kt : I don't see anything from you in https://etherpad.wikimedia.org/p/search-standup for this week. Could you make sure you update the standup notes by Thursday evening next week? [08:54:48] Sure, sorry. [08:57:19] gehel: A lot of my time went into onboarding (myself) for WDQS topics. Should that go into the notes? [08:59:04] pfischer: yes, that's helpful for me to know that you what you spent time on, if only to validate that you did not just forget to report on it. [09:02:07] Done [09:29:17] pfischer: thanks! [09:29:42] update published: https://wikitech.wikimedia.org/wiki/Search_Platform/Weekly_Updates/2024-05-03 [09:32:32] wondering if we can drop one icinga check (https://gerrit.wikimedia.org/g/operations/puppet/+/83a2a0a080d67e703b9b2ee351a7db146db5ffcc/modules/profile/manifests/elasticsearch/alerts.pp#77) now that we have https://gerrit.wikimedia.org/r/c/operations/alerts/+/1025453 [09:35:01] unsure they count the same things tho, the icinga one depends on a cirrus metric that is able to inspect the json response which might consider some response with HTTP 200 as errors, i.e. partial shard failure maybe? [09:45:45] lunch [10:57:58] thanks for the reminder gehel, missed my personal afternoon reminder (i have one additional on top of the team ones). updated. [11:48:59] going afk for a bit [13:19:31] o/ [13:33:36] I think I'm OK with dropping the above icinga check, esp. b/c it uses graphite metrics. I'll get a patch started if that is OK w/everyone [14:59:57] \o [15:01:54] o/ [15:20:33] Patch for removing the ES backend failure check: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1026950 [15:27:58] going to ignore the WDQS lag alert ATM, looks like it's moving thru the backlog really quickly [15:28:50] inflatador: I just depooled it, bot edits started hit the maxlag throttling mechanism [15:29:18] dcausse ah OK, that works [16:00:22] workout, back in ~40 [16:02:51] random curiosity of the day, on desktop there are more fulltext search sessions from users with 1k+ edits than from users with 1-999 edits. [16:07:59] Power users… [16:50:40] back [17:12:53] inflatador: I only realised yesterday but Monday is a public holiday here in Ireland, so I won't make it to the incident review Monday after all :( [17:13:22] if you'd rather I be there maybe we can push it back to the following review session [17:14:06] topranks no problem, I think I got it. I have been working on that monitoring script a bit, was gonna ping you to take a look once I have a stable way to get all hosts in all pools [17:14:23] ok cool [17:14:38] and yep by all means ping me on that or point me to what you got sounds great :) [17:15:04] For sure. I'm doing a bit of a hack to work around config-master missing some services, but not too tough overall [17:27:35] lunch, back in ~1h [18:00:36] back [18:44:49] inflatador: got a patch up to shift 2023 to a graph split host. PCC running now while I take the dog out: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1027001 [18:45:06] ryankemper :eyes [18:58:33] errand, back in ~20 [19:12:11] back [19:28:14] gotta move something heavy...back in ~30 [19:29:41] back [20:07:49] Alright, looks like I got PCC happy [20:12:08] I suppose I should make sure first that transfer.py works across datacenters, since I'll be xfering from stat1006 to wdqs2023. If not then I'll just change the patch from 2023 to an eqiad host [20:12:12] First things first though, grabbing lunch [20:57:00] just got back, LMK if you need anyhting [20:59:52] ryankemper +1'd the graph split patch [21:15:02] back [21:18:37] inflatador: this command look reasonable? (haven't merged the patch itself yet but wanted to pre-compose the transfer.py cmd) `transfer.py --port 9876 --type file --checksum stat1006.eqiad.wmnet:/home/dr0ptp4kt/gzips/nt_wd_schol/ wdqs2023.codfw.wmnet:/home/ryankemper/nt_wd_schol` [21:18:53] (usage: https://doc.wikimedia.org/transferpy/master/usage.html#) [21:21:23] ryankemper how big is the file? [21:21:26] Small worry I have is --checksum might be very slow [21:21:43] inflatador: 94G directory [21:23:39] I would add `--verbose` and make sure port 9876 is open on dest...`iptables -I INPUT 1 -p tcp --dport 9876 -j ACCEPT` will do that. Just make sure you disable puppet on dest to keep the port open and re-enable when done to close the port [21:24:15] I'd use `no checksum` or `parallel checksum` [21:25:19] I'd probably tar up the entire directory and checksum it by hand as a single file on src and dest [21:26:04] inflatador: port will be opened via https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/query_service/graph_split.pp so we shouldn't need to one-off it [21:26:24] but I dunno, you could try with checksum enabled...I remember that being a problem when we were transferring a single very large file. Might not be a problem with a bunch of small files [21:27:02] I'd say start w/checksum enabled and if you run into problems, tar up the dir and run the checksums manually [21:27:12] and transfer without checksums [21:30:05] Sounds like a plan [22:11:42] Transfer complete. 94G on dest so it probably worked properly. Unfortunately the transfer py output is very hard to read, and it's doing the cumin thing where it cuts out the part of the output you actually care about: [22:11:45] `0.0% (0/1) success ratio (< 100.0% threshold) for command: '/bin/bash -c "[ ...ource.md5sum" ]"'. Aborting.` [23:04:33] First chunk seemed to work fine, kicking off the rest of the reload: [23:04:38] https://www.irccloud.com/pastebin/SZexnxdp/