[08:47:41] A short GitLab maintenance will happen at 11:30 UTC [11:36:53] GitLab maintenance finished [14:16:37] does anyone have time to help me ID abusive query for WDQS? Haven't had to do this in ~6mo and I'm a bit rusty [14:16:58] * inflatador is looking thru superset and turnilo per https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Identifying_Abusive_Traffic [14:26:25] the reimage cookbook failed at sync-netbox-hiera.py, is it a known issue? [14:26:53] should not be, no [14:26:56] what's the error? [14:27:16] wmflib.interactive.InputError: Too many invalid answers [14:27:26] and before that? [14:27:39] it asked for manual confirmation, I don't remember it doing the last time I used it [14:27:49] can you paste the full output? [14:27:52] sure [14:28:10] which cumin host were you running this on? I can just tail the logs [14:29:14] https://phabricator.wikimedia.org/P67253 [14:29:16] cumin1002 [14:29:40] yeah, your user response is "", it should be "go" [14:29:49] that diff looks reasonable to me [14:29:51] like a literal go, not just a blank enter [14:29:51] I probably just pressed enter without noticing [14:29:54] yeah [14:29:56] but when did it start asking that? [14:30:11] it's quite standard output for netbox changes [14:30:16] so it should have been there for a while [14:30:21] I reimaged other hosts and I don't remember every typing "go" [14:30:25] *ever [14:30:37] probably because you were reimaging existing hosts that had the netbox status set [14:30:37] maybe the netbox state was different? [14:30:40] whereas this is a new one [14:30:43] yeah [14:30:45] not a new one though [14:30:50] so it's strange it was set to "failed" [14:30:59] I just upgraded it to bookworm, but it was in service [14:31:16] XioNoX was fixing some issue with that script earlier today, but I'm not sure what [14:32:15] I'll try running "sre.puppet.sync-netbox-hiera" manually [14:32:34] https://netbox.wikimedia.org/extras/changelog/183322/ [14:33:01] it shows that the change was made though hmm [14:33:18] dhinus: oh, looks like you set it to 'failed' back in June? https://netbox.wikimedia.org/extras/changelog/176037/ [14:33:32] yep, just seen that [14:33:42] it did have some issue back then I guess [14:33:56] and I probably forgot to revert that [14:34:01] that explains it then [14:34:17] now I'm confused by why the change today is appearing, even if I didn't type "go" [14:34:29] yeah that's a weird one, it should have failed to commit the changes [14:34:39] unless someone else ran the cookbook in the meantime or synced the changes [14:35:07] don't see any other cookbook runs for sync-netbox-hiera in SAL though [14:35:27] I see other reimages for other hosts in SAL [14:35:48] I think is not logged when it's called from the reimage cookbook [14:35:55] there are ml-serve ones but not sync-netbox-hiera which are logged independently [14:35:59] *I think sync-netbox-hiera is not logged [14:36:26] no you're right it's logged even when called from the other one [14:36:34] seems like it is https://sal.toolforge.org/production?p=0&q=puppet.sync-netbox-hiera+&d= (follows a reimge) [14:36:41] so I don't know... [14:37:24] I think we found the abusive IP, do I just add the IPs to `blocked_nets.yaml` and run `requestctl sync`? [14:39:26] inflatador: this is for? [14:40:12] sukhe someone is hammering WDQS in a way that doesn't show up in query logs [14:40:25] causing WDQS CODFW to flap, more details in https://etherpad.wikimedia.org/p/wdqs-2024-08-08 [14:41:21] inflatador: that will block them from all wikimedia services, which is probably not what you want [14:42:09] ^ [14:43:25] dhinus: the re-image cookbook sets the status to active once it ran properly [14:43:46] dhinus: and the hiera "go" is standard and comes each time there is a status change in Netbox [14:44:22] cdanis ACK , what is the recommended strategy? I guess I could block them at nginx level? [14:44:46] inflatador: that is what wdqs has generally done in the past AIUI. you could use requestctl if you wanted, but it would be more moving parts [14:45:03] as in, the nginx running on individual WDQS hosts? We haven't found a smoking gun in turnilo/superset so we're more at the "ban this and see what happens" phase [14:45:15] roughly following the steps at https://wikitech.wikimedia.org/wiki/Requestctl#Quick_start:_adding_a_new_rule but adding instead an ipblock for the one attacker IP, and a pattern to match WDQS traffic (probably just by hostname) [14:45:53] XioNoX: yep it all makes sense now. the only thing that I'm not sure about is what updated the status, given I didn't type "go". the changelog in netbox says it was "sre_bot" [14:46:09] inflatador: I'm not at all up-to-date on WDQS but as of a few years ago it was common to put that in directly in the nginx config, I think [14:46:16] dhinus: "dhinus: the re-image cookbook sets the status to active once it ran properly" :) [14:46:21] dhinus: it seems like from the logs, that step is performed by the cookbook and given that is the only change, it's accepted [14:47:25] cdanis there are some user-agent blocks in the wdqs config, although they're pretty old [14:48:32] if you want an easy option now just live-hack it into the nginx config on the host [14:48:46] dhinus: https://github.com/wikimedia/operations-cookbooks/blob/master/cookbooks/sre/hosts/reimage.py#L720 [14:49:33] XioNoX: sukhe: ok thanks, I see it in the logs now [14:49:37] dhinus: typing the 'go' is asking confirmation to merge the netbox state back into the hiera files where it affects production -- not to make the write to netbox in the first place [14:49:53] so I still need to run sre.puppet.sync-netbox-hiera manually [14:50:20] dhinus: doesn't hurt to run it and see if there is an outstanding diff [14:50:24] dhinus: yes, or the next person running the cookbook can merge your changes, but either way [14:50:53] better leave it clean for the next person :) [14:50:54] cdanis requestctl is easier since it gets all WDQS hosts. Is it common/acceptable to use requestctl in a trial and error situation? If not I'll hack something together to hit nginx [14:51:15] inflatador: totally is fine, just remember it affects all usages of the CDN for any affected queries [14:52:49] XioNoX: sukhe: done! thanks for your help :) [14:53:16] thanks for fixing it :) [14:53:34] inflatador: if you aren't sure, you can always use `log_matching` and wait a few minutes, then look at superset to see what actually matched [14:55:52] cdanis ACK. What about global rate-limiting by UA? envoy/nginx on the WDQS config side? [14:56:24] either that or requestctl are fine for that [14:59:35] ah, I see now, thanks again [15:00:04] to be clear, requestctl is also fine for your original question [15:01:18] inflatador: what cdanis is saying, we can also do requestctl based on client IP and hostname but not just client IP [15:01:29] ^ [15:04:52] ACK, we're still poking at the data. Will keep y'all updated. Thanks again for the advice [15:07:05] inflatador: happy to do the requestctl rule fwiw (I am on on-call) [15:11:43] np. This apparently has been happening a bit longer than we thought, still going down the rabbit hole ;) [15:39:45] godog: I'm looking at Grafana Cloud to see if there's any clue in their docs or pricing what they might do (store all raw? Charge for longer raw retention? Force creation of separate aggregator rules? Some kind of longer retention as reduced intervals?) [15:41:47] godog: it seems to mostly talk about raw data and retention thereof, with a new "in preview" feature that drops or combines unused labels, and then they mention their own Mimir service as way to do long term storage in a way that's allegedly "fast". [15:44:12] Mimir is kind of like Thanos, but different architecture, and has some special compaction mode that claims to be able to support very high cardinality [15:44:16] AIUI [15:46:00] indeed that's my understanding as well [15:48:37] I wonder if there's something clever they do about rate_interval or otherwise to make dashboards more likely to work transparently on both recent data, "last year" at once in a way that's fast, and eg looking at a week two years ago and have it work at all. [15:49:10] I guess maybe they don't go above 5m interval, or do so only when 1h and 5m have the same retention. [15:49:54] good questions yeah [15:50:24] * godog logging (hah!) off for the day [15:50:41] It still seems odd to me that rate_interval works for them given that in practice its 4x scrape interval of source, which is often between 15s and 1min so at most 4min, but that's below 5min [16:13:26] Could somebody help me get logged into https://wikitech.wikimedia.org/ I've tried all the passwords I think it might be, but haven't been able to find the right combination, I tried going through the recovery flow at https://idm.wikimedia.org/wikimedia/password/ but can't get that to work either. [16:14:04] I do have an email telling me I've made 5 attempts. Did that lock out my account? [16:19:52] Am I supposed to be using the same credentials I use on https://toolsadmin.wikimedia.org/? [16:51:34] roy649: wikitech wiki is in the process of becoming a normal SUL wiki (so same login as on other wikis), but FOR NOW, it's still the developer (LDAP) account. I just confirmed the login still works for me with that. what happens on idm.wikimedia.org ? [16:54:28] roy649: maybe it's not the password but the user name is wrong? what user name did you try? [17:27:06] [cross-post from -serviceops] Can anyone here help me unravel the mystery of a helmfile lint failure that seems unrelated to the change under test? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1060899 https://integration.wikimedia.org/ci/job/helm-lint/19680/console  [18:10:59] mutante: sukhe: just a heads up in case it goes over the paging threshold -- mw-web in eqiad is running quite hot since 15:45, dunno why https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&refresh=1m&from=now-24h&to=now&viewPanel=84 [18:11:19] cdanis: ouch, ok. thanks for sharing! [18:11:52] if it self resolves as it has in the past, great, if not, we will see :) [18:15:32] the magic dust worked so far [18:16:02] I mean the resolve came in but it doesn't seem to be getting any better [20:08:16] jhathaway: can you give https://gerrit.wikimedia.org/r/c/operations/puppet/+/1060919 a look? Do we run puppetserver-deploy-code as root anywhere? [22:22:48] How to join the SREs? [22:25:55] Justman10000: you can volunteer, almost all the code and tasks are in public. also, check on https://wikimediafoundation.org/about/jobs/#section-8 every once in a while [22:27:25] Justman10000: https://phabricator.wikimedia.org/tag/sre/ [22:32:53] mutante Und how to volunteer? [22:34:36] Justman10000: one way would be to find an open ticket that you want to work on and/or upload a patch that fixes something [22:36:49] a good chunk of the code that managers servers/services is in this repo: https://gerrit.wikimedia.org/r/q/project:operations/puppet [22:40:31] What if I want to work on the configuration of the permissions? Or install and configure new extensions/skins? Optimize existing configurations? [22:50:25] mutante [22:51:23] Justman10000: the repo for mediawiki config is at https://gerrit.wikimedia.org/r/q/project:operations/mediawiki-config [22:52:11] Justman10000: you can also use the cloud environment for testing and setting up tools. https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_introduction [22:54:07] mutante And how would I join the SRE? So, with SSH access and such? [22:55:28] Justman10000: you would create an access request ticket in Phabricator to ask for it, have a reason for it and then sign a volunteer NDA [22:56:17] but I would suggest you start with using the cloud stuff described in the link above [22:56:30] you can get shell on a VM that way [22:57:38] see https://wikitech.wikimedia.org/wiki/Help:Cloud_VPS [22:57:46] I gotta run, cu later