[06:48:37] We're in the middle of a big spike in WDQS queries. Spike started 20 mins after wdqs deploy finished so I think it's unrelated to the deploy [06:57:14] https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&from=1743481823320&to=1743490608846 here's the qps graph from approximate time of deploy concluding [06:59:33] I'm also confused why maxlag is firing because I've depooled the only two hosts that are experiencing high lag (wdqs2016 / wdqs2017) [07:01:48] Oh nevermind it's resolved right now. Weird it fired again though [07:04:47] ryankemper: go get some sleep, we'll keep an eye on things [07:05:01] ack! [07:05:01] Thanks for having a look ! [07:45:21] sigh... we nearly tripled the query rate on wdqs, from 10k/min to 30k/min [07:45:55] it looks like there is an NPE on 2016-2017, which seems to block the updater. [07:46:12] I'm opening a phab task, but I suspect we'll need you dcausse to have a look [07:48:35] sure [07:49:46] dcausse: T390665 [07:49:47] T390665: wdqs2016 and 2017 not consuming updates - https://phabricator.wikimedia.org/T390665 [07:53:25] looking [07:54:36] we might need help from brouberol / stevemunene to identify this additional traffic and maybe block it. [07:55:03] how can I do that? [07:56:54] We have some documentation in https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Identifying_Abusive_Traffic. Basically, using Turnilo to see if we can identify an offender. [07:57:30] I suspect that a 3x increase in traffic in such a short timeframe comes from a single user (maybe from multiple IPs). [07:57:50] Depending on what we learn, we might use requestctl to block that traffic. [07:58:05] brouberol: ping dcausse for additional help if needed. [07:58:30] This isn't UBN yet, the problematic servers have been depooled, and the rest of the cluster seems to be able to handle the load. [07:58:39] ack thanks [07:58:45] gehel, brouberol could we get https://gerrit.wikimedia.org/r/c/operations/puppet/+/1131940 with depooled host not being updated we should not trigger wikidata max lag but we do at the moment [08:00:28] AS number 14618 seems to be over-represented in the hits [08:00:33] dcausse looking [08:01:57] so you want me to merge this patch but avoid deploying it on wdqs201{6,7} ? [08:02:50] brouberol: no it's fine deploying everything, esp on wdqs20[67] :) [08:03:29] blazegraph is in bad shape there... [08:03:44] ok, I'm going to trust you on the patch and delpoy [08:04:08] it should just touch the nginx condif [08:04:12] *config [08:05:10] are we good about running [08:05:10] brouberol@cumin2002:~$ sudo cumin 'wdqs2*.codfw.wmnet' run-puppet-agent [08:05:10] ? [08:05:23] aka deploying the patch to the whole codfw wdqs fleet? [08:06:25] brouberol: yes [08:06:36] :spinner: [08:07:18] what we to see is this graph https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=41&from=now-3h&to=now being flat [08:07:29] *we want to see [08:08:28] I'm seeing multiple IPs from AS 14618 hitting wdqs starting ~midnight [08:09:41] brouberol: the bump started around ~5am: https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&refresh=1m&viewPanel=44&from=now-24h&to=now [08:10:45] and it should from ips hitting codfw only [08:11:05] Let's ban AWS! [08:13:18] oof indeed, that is AWS [08:13:41] > and it should from ips hitting codfw only [08:13:41] trying to figure out how to do this is turnillo, which is [08:14:30] We might have a user agent for those requests, and we might be able to ban from that user agent. [08:17:17] I'm seeing the UA `axios/0.22.0` being the source of many requests, starting from 0:00 UTC and peeking again between 5 and 6 am utc [08:18:25] brouberol: see in #sec I found an UA mentionning AWS [08:18:46] let's move to #sec [08:43:52] going to revert what was deployed this night [08:44:49] the new deletion query appears to be way too slow [08:45:40] running scap [08:56:17] o/ [08:56:21] o/ [08:57:15] Not urgent, but I got a bit lost in the thread (just back from a long weekend :P). Are good with the mjolnir internal range as implemented in https://gitlab.wikimedia.org/repos/data-engineering/airflow-dags/-/merge_requests/1186. Can I merge and test? [08:58:21] gmodena: sure, but that might need a rebase, we've fixed a bunch of fixtures while fixing execution_date in unit tests [08:58:52] brouberol: I might your help still, could run "systemctl restart wdqs-updater.service" on the whole wdqs fleet? [08:58:55] *need [08:58:57] dcausse ack. I saw there's a few changes. [08:59:49] dcausse: I can do that [08:59:53] one by one I take it? [09:00:03] with a bit of delay between each host? [09:00:29] brouberol: no not really, you batch multiple ones at once with no delay [09:00:38] *can batch [09:01:48] alright then [09:03:20] dcausse: 1:1 or are you on the WDQS issues? [09:03:29] dcausse: brouberol@cumin2002:~$ sudo cumin -b 3 -s 15 'wdqs*.wmnet' 'systemctl restart wdqs-updater.service' [09:03:36] gehel: I can join [09:03:47] brouberol: +1 [09:04:59] merginig/rebasing in this mono repo feels always a bit sketchy :|. Let's see what CI says. [09:15:18] dcausse: the rolling restart is all done [09:15:44] brouberol: thanks! [09:15:51] np [10:03:31] lunch [10:19:33] an extra restart of blazegraph was required... [10:20:20] we should be back to normal operations in a couple minutes [10:23:39] ok, all alerts should resolve [10:24:19] going back to lunch [11:56:11] wow, there's a big mistake in the new query to delete lexemes... could have deleted all lexeme data... [11:56:30] explains why blazegraph got stuck... [13:13:28] o/ [13:14:46] quick errand [13:14:50] Sounds like WDQS is having some issues...LMK if I can do anything to help [13:52:41] inflatador: yes some troubles today, a serious bug I introduced causing the updates to stall and a bot from amazon tripling the query rate, was hard to figure out what was going on but it's back to normal now [14:03:26] Trey314159 got a minute to help with a regex? ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/1132772/3/hieradata/regex.yaml#2 [14:09:07] not urgent BTW [14:10:24] \o [14:10:40] going to a snapshot build on wdqs2025 (test machine) to make sure I got that delete query right this time [14:10:42] o/ [14:10:45] *to test [14:10:54] It works according to https://regexp.online/ but just wanted to make sure [14:24:59] inflatador: that regex looks straightforward enough. Matches 6 elastic hosts and 1 cirrus host. The parens around 055 aren't strictly required, but they make sense to have if you expect to possibly add more cirrus hosts in the future. Looks good! [14:25:21] Trey314159 excellent, thanks for taking a look! [14:32:55] deployed a test build to wdqs2025:/srv/deployment/wdqs/wdqs/lib/streaming-updater-consumer-0.3.150-jar-with-dependencies.jar [14:58:30] ok seems to work ok this time [15:25:08] ryankemper can you join our pairing [15:25:20] yeah [15:36:50] random thoughts, is there any harm to always running Exec[nginx-reload] on a puppet run, instead of only when refreshed? [15:37:12] thinking about the TLS cert that expired, and how the Exec[nginx-reload] wasn't run beause of a transient error. Ideally it seems that would have been run in a future run of puppet [15:37:51] Otherwise i suppose something could try and compare the TLS cert nginx responds with against the one on disk and refresh if there is a mismatch, but always reloading seems easier [15:41:09] I have no clue... what was this transient error? [15:41:19] Failed to open TCP connection to puppetserver1003.eqiad.wmnet:8140 (Connection refused - connect(2) for "puppetserver1003.eqiad.wmnet" port 8140) [15:41:44] basically while it was running the nginx module there was a temporary problem connecting to puppetserver, which caused the Exec[nginx-reload] to be unscheduled due to dependencies [15:41:54] so the cert got replaced, but the reload never ran [15:42:35] weird... should puppet have failed altogether and mark the host an unhealthy? [15:43:00] puppet did fail, but the next puppet run was fine. The problem is the cert was already replaced by then, so it didn't know it needed to reload nginx on future runs [15:43:13] Basically cert replaced -> puppet failed -> no nginx-reload [15:43:39] sampling of logs: https://phabricator.wikimedia.org/T390599#10695916 [15:44:32] ah, puppet being non transactionnal it did not revert to the old cert so did not detect that change on the second run [15:45:06] i suppose in my ideal world, puppet would have collected all the file metadata with the initial catalog and wouldn't change anything on the server until it has all the things it needs, but apparently it doesn't work that way:P [15:46:25] I have no clue how bad that'd be to run nginx-reload every 30mins? the few times I used it it was pretty fast [15:46:26] yea, non-transactional is a good description of the problem [15:47:00] but it was on hosts with what I'd consider low traffic like a wdqs node [15:47:29] in theory nginx-reload should be a noop most of the time no? [15:48:10] i think it will still gracefull restart regardless, IIRC it essentiall stops accepting new work then restarts itself while keeping the sockets open and passing them on to the new process [15:48:31] (at least, thats how servers did it 25 years ago, i might be out of date :P) [15:48:39] but nginx is also 20-ish years old [15:49:05] i dunno, maybe not worth worrying about...but seems like something that should be solvable with automation [15:51:51] reading the reload doc you're right it does all this gracefully but involves closing connections [15:53:16] maybe the right answer then is something that compares the tls cert from nginx against the tls cert on disk...will see if thats hard [15:53:42] but might require extra knowledge about how the certs are used, likely the cert renewal doesn't know what port for example [15:54:19] all the liveness sensors are targetting the plain http port? [15:54:50] not sure, i'll have to poke around puppet. I never fully understood how the tls bits interact [15:55:41] we do have cert expiration alerts BTW...my team got them on Mar 27th, we just didn't act in time [15:57:26] inflatador: sure, but i don't like having to react on alerts, it's failure prone :) I've found a command that will give us a fingerprint of the running cert based on port + certname, can maybe do something with that. A small bash script that takes port+name+cert path and exits with 0 or 1 [16:23:52] workout, back in ~45 [16:51:40] have a script that does the comparisson now, but not sure where it fits in puppet :P [16:53:21] dinner [17:31:51] quick patch for fixing up the cirrus profile if anyone has a chance to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133191 [17:34:40] lunch, back in 45 [18:20:35] back [18:42:52] one more puppet patch for fixing hiera if anyone has time to look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133211 [18:42:56] probably won't be the last ;( [18:44:11] inflatador: looks to match the role name, +1 [18:46:23] ebernhardson thanks for the +1, let's see how far we get this time ;) [19:22:21] ¡Uno mas! https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133220 [19:24:01] lol [19:24:39] fun, but i suppose there is no easy way around that [19:30:07] yeah, I guess the relative lack of problems with cloudelastic/relforge made me a bit complacent [19:30:59] and yeah, I agree re: your comment on CR. We'll have to migrate at some point...I started to look at until i realized it was touching LVS ;P [19:34:26] migrating live things is always fun...will see :) [20:06:12] One more that should hopefully get us to the next error: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133234 cc ryankemper [20:06:32] +1'd [20:11:12] ACK, thanks! [20:17:45] * ebernhardson notices that ttmserver is configured to only create a single replica [20:18:09] although curiously, the prod indices have 2 [20:18:25] but only in eqiad, and not codfw...fun :P [20:21:19] whoops! We should probably bump that up [20:24:27] BTW, we're getting puppet errors like `Function lookup() did not find a value for the name 'profile::opensearch::cirrus::enable_remote_search'` now...it looks like we're going to have some problems if we use `role::cirrus::opensearch` with `profile::opensearch::cirrus` [20:24:53] yea i'll make a patch, i'm noticing various config things while looking over configs to migrate cirrus read traffic to search.discovery.wmnet. Probably worth a ticket to review multiple things (unused config vars, low replicas in ttmserver, maybe more) [20:26:54] inflatador: isn't that just missing the appropriate hiera config in the role file? seems like it should work [20:30:48] ebernhardson you might be right...checking. So this is the role file https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/role/manifests/cirrus/opensearch.pp . which pulls in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/profile/manifests/opensearch/cirrus/server.pp [20:32:08] which points to a typo: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/hieradata/role/common/cirrus/opensearch.yaml#76 [20:32:33] ahha, yea that will do it [20:35:39] CR to fix this: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133242 cc: ryankemper [20:36:43] * ebernhardson is finding the deployment-prep vs prod configuration variations in wmf-config tedious... [20:51:52] Ibid. https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133243 [22:08:26] we got the puppet catalog to comile [22:08:30] or compile, even [22:17:56] ryankemper cirrussearch2005 is reachable thru SSH if you wanna check it out. Looks like opensearch isn't able to start [22:18:05] eh...cirrussearch2055 that is [22:18:19] ack, will look [22:18:45] cool, headed out for the day. keep us posted! [22:28:39] whole bunch of errors. hard to tell which are independent versus occurring because other stuff is broken https://usercontent.irccloud-cdn.com/file/JFHIoKUV/opensearch_errors.log [22:31:18] Starting with the first error `OpenSearchJsonLayout contains invalid attributes "compact", "complete"`. Might be related to this config: `modules/opensearch/templates/log4j2_1.properties.erb:appender.ship_to_logstash.layout.compact=true`