[06:58:12] inflatador: I just had time to read the email. Good one! [06:58:25] Should we save it as a template as part of the WDQS runbook? [13:09:26] \o [13:16:19] gehel np, dr0ptp4kt and ryankemper helped as well [13:24:15] ebernhardson LMK if/when you need that cirrus secret added [13:25:04] inflatador: still trying to debug why it doesn't work :( [13:25:14] my usual tooling doesn't work in k8s :P [13:26:22] maybe we can look at it at pairing today. I need to learn how to do the whole "debugging sidecar" thing too...haven't exactly been looking fwd to it either ;P [13:27:44] i read a little into it to understand what it does, but didn't really figure out how to start one in our env [13:28:04] it seems it's mostly about having a second container with all the same cgroup conf / namespaces / etc, so it basically "sees" the same stuff [13:31:22] inflatador: I'll be 2 minutes late [13:31:32] gehel ACK, np [13:35:02] and maybe a few more [13:35:22] np [13:40:47] I found `scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values --set wmfdebug_enabled=true stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging]` in the SAL, maybe adding that `--set wmfdebug_enabled=true` would do the trick? [13:41:00] in the helmfile, that is [13:41:15] or as an additonal var at the cmdline? [13:41:37] hmm, so --set isn't going to do anything on its own, it's going to make some variable available to helmfile templating. The question would be where that gets used [13:42:17] I don't see that string in deployment-charts [13:42:20] `wmfdebug_enabled` that is [13:42:53] me either :S i do see mention of it in `git log -Swmfdebug` but the references are old and are removing it [13:45:27] actually looks like there are a few explicit references to wmfdebug in eventgate and eventstreams [13:46:53] for eventstreams that sets a number of rather arbitrary things probably related to making debugging possible [14:17:07] meh, the summary is filesystems are mounted nosuid, so sudo can't work, meaning tcpdump can't work :S [14:17:48] maybe the wmfdebug image can start as root and stay that way instead of running as a lower perm user? I dunno...i guess can try it out [14:21:07] i suspect threading that through the flink chart is going to be more tedious :P [14:22:29] Hello. We are deploying your airflow-dags to an-airflow1005.eqiad.wmnet due to changes required by airflow 2.9.3. [14:23:01] btullis: thanks! i had pondered doing the same but figured you all would get to it after things are stable [14:23:33] ebernhardson: Ack, apologies for the slight bumpiness. [15:06:46] i'll poke around at the other options first, if we can have a generic debuggable environment that would be excellent, but if its not going anywhere dropping down a level or two in abstractions usually works (but needs more permissions) [15:43:11] hmm, gerrit not responding :S [15:44:30] might be comcast :S mtr shows significant packet drop between sfba.comcast.net and Chicago3.net.lumen.tech :( [15:45:51] wonder if i can force gerrit connections to route through codfw, that seems to work fine [16:12:56] i suspect https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1062438 does as we need, maybe it needs a runAsUser: 0 in securityContext to enable tcpdump, not sure. Anyone willing to +1 and we can try it out? [16:17:38] dr0ptp4kt: want to come to sre pairing today in a little over 2 hours? we can go over data reload validation and whatnot [16:21:38] ebernhardson just +1, working out but will be back in ~40 if you wanna pair on it. Otherwise we can wait till our normal 11:30 PDT [16:57:31] well, guess today is learn more about k8s day. Enabling the debug sidecar in staging causes the current pods to shutdown, and then nothing :P [16:59:37] what does `kubectl get events` have to say? [17:00:24] oh interesting, learning more already :) It says Error: cannot find volume "flink-config-volume" to mount into container "flink-main-container" [17:00:30] which suggests i misformatted something i suppose [17:00:55] or, actually, also here -- https://logstash.wikimedia.org/goto/bc563888ba8ef91e6e9e455d41a1fe1f [17:01:27] ah, and there's also uh this [17:01:40] https://logstash.wikimedia.org/app/discover#/doc/logstash-*/logstash-k8s-1-7.0.0-1-2024.08.13?id=fG6rTJEBDD1VxFBSFo6Y [17:02:52] ahh, i had wondered if that was going to play nice, i was hoping the restrictions stayed on the main container and didn't effect the -debug container [17:03:21] have some threads to pull on now, thats a big help! [17:04:32] not 100% sure i need ptrace, but was thinking if tcpdump didn't play nice due to some networking shenanigans, strace -e trace=network would be able to see the request/response cycle and would need SYS_PTRACE [17:05:40] I forget, are you a root? [17:05:53] cdanis: only on our servers, not in general [17:05:59] hmmmm [17:06:31] i wonder if we could extend that to the staging cluster k8s worker nodes as well, that would let you do things like, as root on the machine, launch tcpdump in the netns of the pod (... I think) [17:06:41] maybe i should try...have pondered it before and suspect i'm trusted enough that it might go through, but it's also nice to not be blameable :P [17:06:44] haha [17:07:03] btw there's also #wikimedia-k8s-sig for when you get stuck on something like this [17:07:13] +1 to cdanis suggestion, I'd never be able to helm chart w/out that approach ;) [17:08:15] root on the staging cluster would perhaps be viable, and would make this whole thing much easier. I wonder if that could also apply to david and peter. Part of the reason i was digging more into this than having brian do it during pairing was to make it a generally available debugging ability [17:09:42] what's the larger context behind what you're debugging? [17:09:55] oh lol and I just saw this earlier post of yours 11:45:52 <+ebernhardson> wonder if i can force gerrit connections to route through codfw, that seems to work fine [17:10:23] there is a http request/response cycle between our app and mediawiki. In my manual tests from inside the container using bash tcp it works, but when the app does it it fails. I wanted to see the raw request/response to see which side is doing it wrong (i wrote both sides) [17:10:40] https://wikitech.wikimedia.org/wiki/Wmf-laptop this package includes this script "tunnelencabulator" https://gerrit.wikimedia.org/r/plugins/gitiles/operations/debs/wmf-sre-laptop/+/refs/heads/master/scripts/tunnelencabulator [17:11:00] which you can also just install from right there, which will dns-re-route you to another datacenter [17:11:39] ok, I guess one other good option for debugging might be envoy logs? (is traffic both ways going via envoy?) [17:11:40] Reading those comments always cheers me up ;P [17:12:33] we talked a bit about making some more user-friendly, non-root-requiring debugging tools at DPE SRE stand-up today...first step is taking inventory of what we already have [17:12:35] yea it does go through envoy, i had seen that it might be possible but was wary of logging because the parts i'm debugging are mediawiki auth. Although i can always change the tokens after debugging [17:13:10] essentially this is a new auth scheme that lets our app have read-access to all the private wiki content thats not supposed to be read [17:13:13] right [17:14:53] what fqdn is your app calling mediawiki at? [17:15:14] localhost:6500, which goes to mw-api-int-async iirc. double checking [17:16:18] looks like its been renamed, now mw-api-int, via mw-api-int.discovery.wmnet [17:17:37] hm [17:17:39] for a quick hack, in staging, you could perhaps re-point it at the public address, and add an x-wikimedia-debug header so it lands on an mwdebug bare metal host, and dump the request/response there [17:18:20] dr0ptp4kt: yeah 35mins later works. adding u to invite [17:19:25] thx [17:19:40] cdanis: hmm, yea that does seem possible. Btw do you know where the `PodSecurity "restricted:latest"` comes from? Not seeing that anywhere in the flink-app chart (but custom policys are found in other chatrs) [17:20:50] suspecting i could elide base.helper.restrictedSecurityContext when debug is enabled, but not clear if thats the only bit [17:21:21] yeah, I think that comes from cluster-wide settings [17:22:05] https://wikitech.wikimedia.org/wiki/User:JMeybohm/PSP_Replacement i think is probably the best 'starting point' if you are interested in a lot of background context [17:23:30] and I'm not sure if there's any control mechanisms around asking pods to run as 'privileged' security context, or not [17:23:37] aside from knowing that we do that for mediawiki at present [17:24:43] hmm, makes sense. I'll see what i can do through envoy then, it seems like trying to workaround the security restrictions isn't going to go very far [17:27:37] ryankemper I just started a rolling reboot of elastic eqiad via cumin2002 [17:28:41] ack [17:40:22] I'm seeing the elastic health checks go into red a bit too often, dialing back to 2 hosts at a time [17:41:43] nah, that didn't help. Must be a hanging alias [17:49:25] didn't find a hanging alias w/curl...hmm [17:55:45] Correction: I'm doing restarts of the elastic processes as opposed to reboots. The main cluster is going into red about 30 seconds/restart. Not sure if this is worth pursuing but it seems like more red than I'm used to seeing [17:57:29] that does seem odd, red means all 3 shards of something are missing [17:58:02] eqiad or codfw [17:59:00] eqiad [17:59:07] what it could be is a reindex that didn't get cleaned up, they have 0 replicas while being created [17:59:52] I think that must be it, I haven't seen it happen over the last few batches [17:59:53] looks like eswiki_content and commonswiki_file both have 0 replicas [18:00:06] checking whats live [18:00:51] we're red again [18:01:19] yea thats probably it, neither index is live (should always be the case for 0 replica indices) but they are generally populated. The live index looks to be newer than the 0 replica index in both cases [18:01:23] is check_indices.py the best way to test this, or should I just `cat/indices` and compare to `cat/aliases`? [18:01:51] hmm, check_indices.py should find it but will take a little while. I suppose worth checking since it does more checks [18:02:00] right now i just looked at curl https://search.svc.eqiad.wmnet:9243/_cat/indices | awk '$6 == 0 { print $0 }' and repeated for 9443 and 9643 [18:02:21] then `curl https://search.svc.eqiad.wmnet:9243/_cat/aliases/eswiki_content` to see whats live [18:03:06] * inflatador should really write this down, pretty sure I've done this a few times [18:03:59] oops, need to eat lunch before pairing...back in a few [18:06:38] wonder why the reindex didn't get cleaned up...we added bits to cirrus thats supposed to make this not happen :( Also these are surprisingly old, eswiki_content is june 1st, commonswiki_file is may 29th. I suppose that was probably near when i was writing the newer reindexing scripts [18:07:13] check_indices.py reports a number of failed reindexes across the clusters, so yea something not working right in the cleanup stage [18:27:56] WDQS documentation on Wikidata says Blazegraph is a "Complete implementation [of SPARQL 1.1 with] no significant deviations". Isn't this strictly not true because Blazegraph extends the standard with custom features? [18:29:16] harej: hmm, probably. It sounds like something that was written in the early days as the intent [18:29:43] ryankemper inflatador i did the gcal 'prose new time thing for the pairing to start in 35.5 minutes from now, hoping that works. gonna hop on another call in a minute [18:31:01] harej: a good observation. the spec may have the notion for extension (i recall reading about this, but forget the details). but it's definitely the case that not every other engine supports their extensions (they might supersede the need for some of their things, or they may not, but it depends) [18:32:27] back [18:32:37] Would "complete implementation of SPARQL 1.1 with custom extensions" be more "correct"? [18:36:56] oh ryankemper inflatador i see MrG is the meeting owner and it's marked declined and so okay if we just meet in 30 minutes? i think that's what the backscroll suggests, but wanted to make sure that was the case [18:37:58] dr0ptp4kt: yeah that's fine w me [18:39:00] invitation on the way [18:42:34] dr0ptp4kt already in pairing w/Erik, probably won't make y'all's pairing [18:42:45] thx inflatador for heads up [19:08:52] dr0ptp4kt: had to rejoin a few times, in now [19:52:22] * ebernhardson realizes we could have envoy inject the auth header instead of doing it at the application level [19:59:31] that might (?) make it easier to troubleshoot [20:01:05] not sure, i'm still poking around to try and understand how to get envoy logging the outgoing connections, so far i'm only seeing it log the incoming connections from prometheus [20:09:41] ebernhardson: there's a few different listener configs [20:10:37] cdanis: indeed i've just noticed that poking around, incoming ends up in /var/log/envoy, others to stdout. Many layers to this templating :) [20:10:42] yeah [20:10:44] :) [20:11:36] inflatador: fyi i think we still have an ipblock active wrt requestctl. this one: `expression: ipblock@abuse/wdqs AND pattern@sites/wdqs` in commit `f0a6b1654682cc887279494d9ee2a7c5ca8c75b0` [20:14:09] ryankemper ACK, we should remove it, but since we had some issues yesterday let's wait until tomorrow when e-lukey is around [20:14:19] kk [20:15:22] inflatador: I'm gonna go ahead and merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/976308. it's removing nginx bans from a couple years ago. i figure now's a good time to test if stuff blows up without those old blocks :) [20:16:21] ryankemper yeah, that makes sense [20:16:26] added my +1 [20:18:29] thx. puppet running now in 3 batches. so if we see wonkiness starting ~10 mins from now then we know what the cause is [20:18:38] we do need to clean up our requestctl rules, looks like I never removed the previous throttling rule [20:31:58] yeah, let's clean all that up tomorrow morning [20:32:20] and also add a page to our documentation that explains how to do requestctl rules for ip block, ua block, ip+ua block [20:34:59] maybe just link to https://wikitech.wikimedia.org/wiki/Requestctl unless there is something missing there? [20:46:16] I'm thinking just something a little more handcrafted for our use case, and then a link to the general doc as well [20:46:26] cuz we're only ever gonna use it for 3 things: ip block, ua block, ip+ua block [21:02:20] pfischer: i don't know why chrome is having connectivity issues but other UAs are okay, but anyway: nice talking with you! have a good night [21:03:13] dr0ptp4kt: Alright, thank you for your time! [21:03:20] likewise! [21:22:02] everyone's probably aware of this already, but there appears to be a "a full editing outage" happening on wikipedia ATM. I'm also seeing a bunch of WDQS hosts in eqiad becoming unresponsive [21:25:46] 🫣 [21:32:40] curious that wdqs fails, those things should be totally unrelated [21:32:49] yeah, me too [21:32:51] maybe the updater would fail due to not getting things, but blazegraph hould be fine [21:33:46] I'm gonna go ahead and failover to codfw, might not change anything but we're down 6 hosts ATM [21:34:36] ebernhardson or anyone else, if you have concerns about DC failover speak now ;) [21:34:52] ryankemper ^^ [21:34:59] well, worst case it's our friend with the killer queries, and it kills eqiad too :P [21:35:11] (with bad timing that happens to coincide) [21:35:13] but it's probably fine [21:35:53] My own WDQS are updating quite easily because of the lack of edits [21:36:27] lol [21:37:04] Not sure it makes sense to compare my set up to yours though because you have the special updater and I don’t [21:38:13] ebernhardson ooh, good point re: bad queries. Let me try rebooting these hosts from OOB first and we'll failover if that doesn't help [21:40:25] harej: it makes sense, if there aren't edits coming through then the updater should just be idle [21:40:53] Makes me wonder though why yours are failing [21:41:56] we've also had a problem with killer queries recently, someone found a bug in blazegraph we suspect. We sent them an email and asked them to stop, because they kept changing their UA after we banned them [21:42:00] so maybe just a coincidence [21:42:35] their email comes from taiwan, where its 5:42am, but hard to say where they are located [21:42:52] is it because of backend mwapi? looking at wdqs-blazegraph.log on wdqs1019 [21:43:18] dr0ptp4kt: hmm, can wdqs talk to mwapi directly? I thought it couldn't but there might be some custom service i'm unfamiliar with [21:43:56] i have to do a school run now though, back in ~20min or so [21:44:37] dr0ptp4kt hmm, I didn't realize that it talks to mwapi either, but the outage absolutely lines up with the bigger outage across WMF [21:44:40] i do see more of what looks like an altered ua string on wdqs2007.codfw.wmnet with similar behavior as before, but i'm guessing that's unrelated [21:45:40] we see patterns in all things, so who knows 😵‍💫 , but i'm thinking maybe https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual/MWAPI [21:47:14] again, maybe that itself is also coincidental! [21:49:19] wdqs101[3-5],1018,1020 are back up after I rebooted from DRAC [22:06:17] dr0ptp4kt did you find anything re: wdqs/mwapi connections? I'm out for today/tomorrow but maybe we can kick around on Thurs? I started T372442 for this [22:06:17] T372442: Determine if WDQS was affected by wikipedia editing outage/consider protections in similiar future scenarios - https://phabricator.wikimedia.org/T372442 [22:14:20] inflatador: not sure. yeah, i'd be interested to discuss thursday [22:22:05] still finishing up errand, back in 20m to help out [22:29:27] gotta run errands and prepare dinner - txt me if extra hot potatoes [22:59:53] wdqs still looking healthy