[07:39:00] sigh... mul stems did not appear be indexed according to the new mapping on testwikidatawiki... [07:41:09] ah no... it's me, was not looking at the right index [07:42:24] surprised to see that testwikidatawiki is quite large with more than 200k items [07:43:28] cirrus-reindex seems to still be sending helmfile notifications [07:45:58] ah my fault, this fix was not yet released [08:08:23] weird, testwikidatawiki.reindex.log was not appended on the second run for eqiad... [08:08:49] seems to be opened with "at" so should append... [08:13:33] actually "t" does not appear to be an valid file open flag in python [08:16:07] oh wait moving from 0.3 to 0.4 log files are moved to their own folder "$cluster_name" folder [08:20:41] and 't' is actually correct, just not needed since it appears to be the default [08:33:18] reindexing wikidata@eqiad it triggered the backfill before the reindex was complete [08:33:53] "Starting backfill on 1 wiki(s) for 2024-08-08T07:52:52.523870 to 2024-08-08T08:02:49.607243" [08:34:39] I should have looked at was inside eqiad/state.json before it ran [08:39:22] oh actually it's quite smart it ran the backfill for testwikidatawiki from the previous run (I'm sure it ran tho) [08:40:13] I think I'm using the script in weird way, probably, re-running with different --dbnames options but without resetting its state [08:48:55] seems like it detected that the backfill for testwikidatawiki_content (previous invocation) was not run and scheduled it [08:49:54] and it seems that it's true the previous invocation ended with "Completed reindex of testwikidatawiki content on the cloudelastic cluster" with return code 0" [08:51:28] without triggering a backfill for this index tyep [10:04:05] lunch [12:22:16] going to start the reindex on codfw with an empty state this time [12:45:27] o/ [12:47:09] e-bernhardson do we have bash in the flink container? You can open TCP connections w/bash, re https://who23.github.io/2020/12/03/sockets-in-your-shell.html [12:53:38] python should be in the flink containers this is weird [12:53:55] we use the same image for flink job written in python [12:54:47] I was thinking of rewriting the the container with blubber, but since it's java the advantages are a bit less [12:55:33] ryankemper did you run a maintenance on wdqs codfw last night? Just wondering as we had some alerts come into our IRC channel [12:59:11] the flink app images are built with blubber no? https://gitlab.wikimedia.org/repos/data-engineering/mediawiki-event-enrichment/-/blob/main/.pipeline/blubber.yaml?ref_type=heads [12:59:16] or you mean the base one? [13:00:31] I think Erik wants to enter the one running in prod [13:01:24] hmmm, maybe I'm thinking of the flink operator image? [13:01:58] I doubt that he wanted to enter the flink-op image [13:02:40] agreed, just that I might be misremembering which image uses a dockerfile [13:04:11] the base images are built with Dockerfile in operations/docker-images/production-images but the images we deploy are built with blubber to include the job artifacts [13:04:27] I created T371549 the other day in hopes of getting more visibility around this kinda thing [13:04:27] T371549: Publish more metadata tags with docker registry images - https://phabricator.wikimedia.org/T371549 [13:05:04] sounds nice [13:23:28] \o [13:24:22] dcausse: re: reindexing, indeed the state.json / output dir expects that a single state is for a single run and that the dbnames are the same each time. If you provide different dbnames i suppose it adds them to the dbnames it already knows. Perhaps could use more input handling there to instead reject or something [13:24:36] basically running again picks up where it left off [13:25:23] the intent was it would run the second time with the exact same cli options as the first time, the re-start after a script failure of some sort [13:27:31] inflatador: interesting idea re-bash, i suppose i had heard of that before but didn't think of it. Will try it out. Indeed i mostly wanted to see that i got the netmask's for auth right before i deploy the full app and have odd errors because it can't talk to mw for some reason [13:32:07] o/ [13:32:18] ebernhardson: yes, that's a nice feature [13:32:55] looks like WDQS codfw hosts are flapping again ;( [13:35:39] yes seems like there's some mem pressure and possible that jvmquake is kicking some hosts [13:36:36] double checked in the taskmanager container, `find / | grep python`, nothing :( [13:37:21] do you think I should depool CODFW? I'm in pairing ATM so I didn't want to interrupt unless it was a serious issue [13:37:46] inflatador: I fear that if you depool codfw the bad actor is going to hit eqiad [13:38:43] dcausse ACK, I can start looking into it now [13:40:18] following runbook https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Identifying_Abusive_Traffic [13:43:08] yes I can see "(jvmquake) Excessive GC: notifying killer thread to trigger OOM" [13:45:32] bash tcp connections worked, cool! But the NetworkSession extension rejected my request :P I guess was good to test... [13:49:10] Still looking in superset...I'm super rusty with this [13:53:22] ebernhardson dcausse if either of y'all have time to look at the WDQS stuff I'm in https://meet.google.com/gex-nyfq-goc . Just wanna sanity-check some stuff before I start banning [14:15:22] inflatador: sorry, i had an hour earlier but its first day of back to school so i'm prepping liam now [14:58:55] back [15:12:30] err, thats going to be annoying. The api returns 200 OK for a failed auth :S [15:17:11] @ryankemper @inflatador @Trey314159 @dcausse @ebernhardson reminders: (1) standup updates on the earlier side today if possible, please; i'm hoping to asana that stuff in my earlier afternoon. and tomorrow is a global org holiday, i hope you enjoy it! [15:17:50] * dr0ptp4kt wonders how i forgot the (2) before the second item [15:18:56] it was implied. [15:19:37] Thanks for the reminder about the holiday. That one snuck up on me! [15:20:53] indeed, i wouldn't have remembered. thanks! [15:21:32] i'm also realizing i don't have any clue how the new mw-on-k8s works. I don't see the mediawiki code anywhere inside the instances :S [15:21:57] * ebernhardson just wanted to verify the private repo updates made it into the containers [15:26:04] oh, of course. it's because there are many containers per pod and not all mount the code [15:29:32] ebernhardson: looks like it works, note the search-omega-eqiad POST /_msearch in there https://trace.wikimedia.org/trace/46908675e23a27dee486ea8b635eec4e :) [15:32:23] cdanis: nice! [15:32:34] https://trace.wikimedia.org/trace/65109dd530c201feb6d5d9db557e89ab [15:33:12] also looks like something needs to run all those mw-api-int requests in parallell, cool to have something to see that [15:36:02] second one is curious, shows it had to fetch cross-wiki config for everything. I wonder if that's common and we need better caching [15:59:11] yes made https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CirrusSearch/+/1051841 when I saw this trace, I'm 99% that's what Aaron intended to do in the patch linked from the comment [16:19:56] oh interesting, i guess i forgot about that patch. Will review again and can probably merge today [16:34:24] ryankemper inflatador able to hop on a meet for a little bit to catch up on wdqs graph split? david and i are available, just finished fortnightly wdqs graph split community call [16:44:20] dr0ptp4kt unfortunately no, working on the WDQS abuse issue ATM [16:44:53] 😵‍💫 [16:45:03] online now [16:45:04] inflatador: did you still need help with that? I'm available now [16:49:30] ryankemper able to hop on with david and me? looks like ebernhardson can help with inflatador fighting wdqs abuse [16:50:05] ebernhardson not ATM, but if this requestctl rule doesn't work I might need another look [16:51:14] dr0ptp4kt: yeah, sounds good. can you send the meet link? i'll be there in 3m [17:04:48] dcausse are you still running the "bad query" on wdqs1020? Just wondering as old QC is still high...LMK if so and if you're OK w/me killing the query [17:06:46] inflatador: no I'm not testing anything there [17:08:03] dcausse ACK, restarted BG [17:08:28] specifically the wdqs-blazegraph.service on 1020 [17:18:17] inflatador: was it causing issues? [17:22:15] dcausse not sure, I just restarted purely based on the increasing GC which started at about 1535 UTC , seems to match the time we were pairing? Regardless, doesn't seem to have made a difference [17:31:50] finally realized i can get mediawiki to tell me my ip address via Special:MyContributions as an anon. It turns out this particular request is seen as coming from 10.64.0.222, even though the container shows its ip as 10.67.149.11 [17:33:06] which falls in the private1-a-eqiad vlan, i was expecting it to come from a wikikube vlan like 10.67.128.0/18 [17:35:27] interesting [17:35:30] reverse dns finds mw1460, maybe some artifact of mw* hosts being added into the k8s cluster as part of the k8s migration? [17:35:46] but still, i would have thought the request would come from the container ip, not the host ip [17:36:53] agreed, is that IP the host's main IP or is it part of the CNI (calico pool) maybe? [17:37:18] I guess that's hard to tell unless you have host access [17:38:03] yea, i'm not sure. I only have a bash shell inside the container [17:38:16] with quite limited utilities (no traceroute or whatnot to see if it's on the path) [17:38:46] do you know which host it's on? I can take a quick look if you like [17:38:57] inflatador: probably mw1460, thats what the reverse dns finds [17:39:23] doing a reverse traceroute, from deploy host to the container ip, it gets to that same ip and then stops responsding [17:39:55] at the end of the day, i guess what i need is to figure out the appropriate netmasks to limit these requests to. I might just add each row of the eqiad and codfw datacenters i guess? Was hoping for something a bit more narrowly tailored [17:40:20] oddly enough that does seem to be the main IP for the host, bound to eno1 [17:40:55] weird, so somehow the container requests are seen as coming from the host ip address. In this case the source and destination containers are both inside the same wikikube k8s cluster [17:41:08] inflatador: various hosts in https://config-master.wikimedia.org/pybal/codfw/search are still depooled. we prob forgot to repool them after some of the previous maintenances. any reason for me not to repool the 6 depooled hosts? [17:42:15] hmm, wish i could see the http headers as received by mediawiki, maybe something wonky is going on the x-forwarded-for or something... [17:59:48] ryankemper thanks for finding that...nah, feel free to repool [18:00:00] random other thing i found, flink is exporting an `endpoint` resource for the rest api's. Wonder if that's useful for anything [18:10:12] https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service needs updating eh?! [18:11:41] i'm looking for WDQS streaming updater docs [18:12:07] ottomata: https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater [18:12:33] well i'll be! [18:12:38] link from main WDQS page? [18:12:39] ty! [18:12:53] i dunno, for some reason i usually use the search functionality :P [18:13:05] i searched but of course just was directed to main page [18:13:05] thank you [18:13:10] no worries [18:13:19] ottomata ACK, I ran into the same issue looking for the WDQS runbook. I need to update docs ;( [18:13:33] the pipeline image on the main page is still the pre flink pipeline [18:14:40] that's...not ideal [18:33:05] inflatador: will be at pairing in 3 mins [18:33:16] ryankemper ACK [18:34:44] FYI I'm using the network probes dashboard to see if the WDQS mitigation worked https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fcustom&var-module=All&orgId=1&viewPanel=3&from=now-3h&to=now [18:36:20] Since we've throttled to 1 connection (1700 on the graph) it looks like we've only had 1 flap. Seems to work, but probably want to wait awhile to be sure [18:52:15] sweet, /me crossing fingers [20:16:52] * ebernhardson gives up on understanding networks and puts the generic ip ranges in the list... [21:12:04] hmm, How_to_deploy_code still references scap sync-file, but i have a suspicion that is not really a thing anymore [21:12:27] it's still a target scap can invoke though [21:16:01] ebernhardson: `scap backport` and `scap train` are the most commonly used commands these days despite what the wiki may still mention. [21:16:18] bd808: hmm, it's neither of those though. I'm updating PrivateSettings.php [21:17:06] things need to build docker containers and deploy them today to change code or config, so sync-file is pretty useless [21:17:23] scap backport is perfect for config changes [21:17:45] hmm, i'll have to check if it can backport without a patch, or i guess i can make a no-op patch [21:18:03] uh... right [21:18:47] ok you have blown out of my drive-by jerk knowledge into other realms. ping brennan or tyler :) [21:19:03] bd808: sure :) I can easily make a patch with no changes and backport that, not a big deal [21:20:35] sync-file *probably* does the right things, but better to ask an active scap internals nerd [21:27:11] looks like cloudelastic-chi is red , probably an orphaned alias...checking it out now [21:27:19] oops nm, back to yellow [21:45:54] ryankemper headed out for the day, I restarted relforge and cloudelastic. Also created T372113 to restart logstash processes in the cook-book [21:45:54] T372113: Enhancement: Support restarting logstash processes during elastic rolling operation - https://phabricator.wikimedia.org/T372113