[08:30:39] hmmm I got a failure on the puppet reimage of lvs6001 due to a failure of a puppet run on cumin1002 and that triggered that the cookbook revoked lvs6001 puppet cert if I'm properly reading the cookbook output [08:31:08] https://www.irccloud.com/pastebin/yeMNMSp8/ [08:31:39] volans: is this expected behavior? the host was basically ok and now I need to reimage it again :_) [08:32:13] vgutierrez: doens't mean forcely that you have to do it [08:36:00] * vgutierrez wondering why that puppet-run failed [08:36:22] vgutierrez: I'm checking between two options what's the best for you [08:37:33] vgutierrez: I don't see a failure on puppetboard so I think run-puppet-agent timed out while waiting for the lock due to another puppet runnng [08:37:48] in 3 seconds? [08:38:10] `00:03<00:00, 3.20s/hosts` [08:38:57] yeah weird [08:39:01] rc=1 [08:39:07] so no ssh problem [08:39:37] started 08:20:09,353 failed at 08:20:12,540 [08:40:11] Mar 13 08:20:12 cumin1002 puppet-agent[3742273]: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists) [08:40:15] race condition? [08:40:59] most likely between the 'wait_for_puppet' check in run-puppet-agent and when it actually runs puppet [08:41:05] lucky you! :D [08:41:46] lovely, so the reimage cookbook stopped before rebooting the host [08:41:52] so for the reimage, two options, one is to re-run it with --no-pxe (and I think you need --new) and it should work unless it fails some pre-condition. Other option [08:43:57] sre.puppet.renew-cert with "D{lvs6001.drmrs.wmnet}" (direct backend to bypass the fact it's not in puppetdb already) [08:45:03] and then a manual reboot a manual check it run puppet at boot successfully and a manual run of the netbox script https://netbox.wikimedia.org/extras/scripts/2/ [08:45:31] I'll send a fix for the reimage to not remove it from puppetdb past the first puppet run [08:46:58] trying with --no-pxe [08:47:08] I had to manually skip the downtime of the host but nothing too bad so far [08:47:19] finger crossed [08:48:40] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1127460 [09:04:13] hmm host seems good, not so sure about netbox [09:04:30] https://www.irccloud.com/pastebin/UJz7xxvE/ [09:06:12] volans: hmm actually we got the same problem on lvs6002 and in lvs6003 [09:06:23] 301 topranks [09:06:26] netbox has the names of the old interfaces (bullseye) rather than the new ones (bookworm) [09:06:43] it's probably because of the vlans specific to LVS hosts [09:06:47] but I don't remember seeing this error before [09:09:33] hmm thanks [09:09:40] :) [09:09:45] * vgutierrez hides [09:09:47] I thought we fixed this already [09:10:00] let me have a quick look [09:14:00] at least all lvs@drmrs are impacted [09:14:20] dunno about ulsfo, eqsin and magru but those could be affected as well [09:18:00] yeah fwiw the issue is because they have the interface relations properly modelled in Netbox [09:18:08] I assume more recently imported than the others [09:18:11] https://usercontent.irccloud-cdn.com/file/ELFiK5Vi/image.png [09:18:36] the script is trying to delete enp175s0f0np0, but the db is throwing an error as the vlan interfaces are children of it [09:18:55] so we should first create the new interface, move the vlan there and then delete? [09:19:01] or just rname the existing interface? [09:19:03] I'll have to fix up the script, it won't take long but need a short amount of time [09:19:26] I think it's easier to delete & re-create, as it covers all potential changes in the setup of vlans on a host [09:19:36] which can even be completely new hardware for instance [09:20:12] if we try to design it to migrate between two "well understood" interface setups that we expect it's probably more trouble to maintain, if we ever have something else down the road [09:21:34] topranks: if we rename those 3 hosts's main interface manually and then re-run the script would it work? [09:21:52] I'll fix the script cos it needs to happen [09:21:59] ok [09:22:05] if you're blocked now yep, we can just delete all the interfaces on the host (bar mgmt) in Netbox [09:22:19] well.... what stage are you at now, is the reimage ongoing? [09:22:24] or does it need to be restarted? [09:22:26] it's finished [09:22:43] I think [09:22:48] netbox sync is the last bit [09:23:17] ok, so we only need to re-run the puppetdb import script? [09:23:58] yes when it's ready with the fix [09:24:06] v.alentin can confirm though [09:24:24] ok.... fwiw I moved to fast and already deleted the secondary ints on lvs6001 [09:24:37] but yep I'll prep the fix and test with netbox-next which is the same [09:25:13] vgutierrez: I also deleted the secondary ints on lvs6002 and lvs6003 manually, so you won't get the issue with them even if before the fix is in place [09:35:04] ack [09:47:29] topranks: that's the first part of the fix BTW, right now in netbox we got the old ifaces [09:48:02] the script just imports what's in puppetdb, so that's where the problem is if it's pulling in the old names [09:48:10] if it's not been run yet that's expected [09:48:31] enp175s0f0np0 instead of ens3f0np0 [09:48:35] hmm so I need to run it manually? [09:51:03] nah I'll tidy that up when I have the new script don't worry about that [09:51:14] I only deleted the stuff manually to unblock you if it was needed [09:51:24] btw - I don't see any vlan sub-interfaces on lvs6001 right now? [09:51:43] (in my testing they weren't re-created properly, but it's because they are not in puppetdb, because they are not configured on the server) [09:53:05] topranks: no vlans needed.. IPIP magic [09:53:24] I think I mentioned it to you the other day that liberica roles aren't creating the vlans [09:53:27] ok... well there you go, better to delete them than rename cos they don't exist anymore [09:53:32] ok [09:53:40] that's good [09:53:42] as it's a requirement for liberica to use IPIP [09:53:51] (so we can switch back and forth between IPVS and katran) [09:57:02] actually that is what the difference is here [09:57:17] the script already catered for the scenario where the parent_int of the vlans changed [09:57:32] it would process all of those normally before it got to deleting non-existant interfaces [09:57:50] but here - as there are no longer any vlan ints on the box - the old vlan ints were left untouched [09:58:07] and then when it tried to delete the old physical int (with old wrong name), it failed [10:31:51] topranks: so that's my fault, please increase your beer counter [10:32:26] <_joe_> oh you have beer counters? [10:40:59] _joe_: https://en.wikipedia.org/wiki/Two_pound_coin#/media/File:British_two_pound_coin_2016_obverse.png is what they look like :) [10:41:23] * Emperor definitely showing their age here, it's been years since you could actually buy a pint with a beer token [11:05:14] <_joe_> I was about to ask [11:36:53] so we have another bad rollout that is related with yesterday's incident turns out [11:37:05] I set up an alert on the status page errors still ongoing [11:37:17] just open search had lag [11:37:25] Amir1: pinging you as we are discussing it [11:37:38] scap has rolled back and we have restarted jobqueue [11:37:53] joe suspects a misbehaving job [11:38:29] so I will make another rollout attempt by leaving out the jobrunners [11:38:43] some correlation as regards jobs fairly significant p99 spikes for most jobs when the deploy starts - https://grafana.wikimedia.org/goto/3Rze5O2NR?orgId=1 [11:38:47] jynus: status? [11:38:57] <_joe_> hnowlan: that's a consequence yes [11:39:10] effie: the same as yesterday [11:39:19] the initial peak was bad, but errors still ongoing [11:39:30] jynus: is it safe to assume that it will recover eventually ? [11:39:39] not at the moment [11:39:46] shite [11:40:26] es only is up because it kills queries, which is no good [11:40:30] <_joe_> effie: mw-apiint in codfw is sufering a lot [11:40:38] <_joe_> maybe we should raise the number of pods? [11:40:44] _joe_: yes thank you just saw it [11:41:13] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127495 [11:41:15] <_joe_> at the very least a roll restart [11:41:16] Amir1: So yes, let's try that [11:41:18] were the pods restarted? [11:41:36] jynus: some yes [11:41:49] Amir1: I can merge and deploy [11:41:55] effie: on it [11:41:58] I was in a meeting and I have anothe rone in 4 minutes [11:42:06] yeah, if anything, things are getting worse [11:42:07] ok [11:42:16] volans: that is ok, we are enough on it atm [11:42:29] didn't get any page, saw the page now on -ops [11:42:36] Amir1: What happens if we disable that cron entirely? [11:42:51] <_joe_> effie: mw-api-int in codfw needs to be roll-restarted, should I do it? [11:42:53] Cause that's what we will have to do in the end, if it gets fixed by decreasing it all the time [11:42:56] volans: It started when I pinged you, no worries [11:43:04] _joe_: I am already there, can do it [11:43:17] <_joe_> marostegui: the problem is triggered by switching $something to php 8.1 [11:43:22] jynus: I got no ping :( [11:43:23] <_joe_> we'll have to find out what [11:43:41] _joe_: But why is reducing that cronjob fixing it? [11:43:49] <_joe_> marostegui: unclear [11:43:54] got it [11:43:56] _joe_: it could also be something triggered by the deploy, and not necesarilly the restart [11:44:02] <_joe_> but the problem clearly gets triggered by the deploy [11:44:07] I don't think disabling it will cause too much disruption, it doesn't serve a massive important feature [11:44:10] *not necesarilly the upgrade I mean [11:44:15] but I want to make sure it's this [11:44:21] <_joe_> jynus: the roll restart is because when php workers stay at 100% busy they hardly ever recover well by themselves [11:44:23] just a jobque restart or something [11:44:34] I will start by rolling back the patches I deployed [11:44:42] yes, no problem with that [11:44:51] what es host is overwhelmed? [11:44:59] Amir1: all shards [11:45:00] es2037 for instance [11:45:06] <_joe_> effie: wait you didn't revert already? [11:45:07] Amir1: https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=1741763693093&to=1741785293093 [11:45:15] mostly es 6 and es7, but all affected [11:45:31] <_joe_> effie: then yes, please [11:46:06] Amir: https://grafana.wikimedia.org/goto/SF6hpOhNR?orgId=1 [11:46:48] gone to staging, going to eqiad [11:47:52] Amir1: ok [11:48:00] Amir1: ping when it is deployed so we can monitor if it gets better [11:48:28] it's deployed in codfw too now [11:49:07] checking impact [11:49:31] <_joe_> positive I'd say [11:49:34] <_joe_> wth [11:49:49] es2037 is not recovering yet [11:49:57] <_joe_> I'm looking at mediawiki [11:49:59] yeah, no es improvement yet [11:50:06] _joe_: mediawiki errors has lag [11:50:12] <_joe_> not errors [11:50:15] <_joe_> latencies [11:50:30] I will run helmfile on the affected releases [11:50:51] <_joe_> effie: wait, they seem to be doing better rn [11:50:52] logstash could take some time to reflect last minute errors be aware [11:51:02] <_joe_> uhm not really [11:51:03] _joe_: it is essentially the reverts [11:51:11] <_joe_> effie: ok [11:51:14] so it should not affect things much [11:51:29] on es2037 when I ran show processlist, unlike yesterday, I can't find jobrunner IPs [11:51:55] yesterday it was all jobrunner IPs right now, it's mw-ext or web [11:52:00] marostegui: you keep monitoring es if you can [11:52:06] jynus: I am [11:52:12] <_joe_> notthing is fixed at [11:52:15] <_joe_> *atm [11:52:26] <_joe_> effie: can you please start from mw-api-int? [11:52:35] <_joe_> in codfw even [11:52:44] 5XX to clients rising [11:52:55] it seems that the helmfile rolling restart never completed [11:53:03] I am looking into it [11:53:25] <_joe_> effie: which restart? [11:53:34] the rolling pod restart [11:53:41] for jobqueue? [11:53:42] so nothing is reverted yet? [11:53:43] <_joe_> of which namespace/release? [11:53:51] <_joe_> marostegui: apparently [11:53:54] now it's job runners [11:54:00] _joe_: mw-api-int [11:54:15] Amir1: es still the same [11:54:16] it takes some time for jobs to finish [11:54:27] (so no new one being queued) [11:54:45] I'm fairly certain this job has a pathological bug [11:54:47] Amir1: but I belive last time recovery was rather quick? [11:54:58] that's not my recollection [11:54:58] jynus: Not that quick, it took a few minutes [11:55:08] ok, I stay corrected [11:55:52] one single job making this many queries: https://trace.wikimedia.org/trace/49b139c793c347fe58fc9b414e52f9d6 [11:55:57] Running helmfile on mw-api-int, to pick up the rollback [11:56:07] I err on the side of actually fully disabling it if it doesn't recover [11:57:12] Let's try one thing at a time [11:57:18] effie: let us know when it's finished [11:57:20] so, to sumarize [11:57:34] a patch to reduce concurrecny was deployed? Amir ? [11:57:41] jynus: yes [11:57:45] ongoing rollback by effie [11:57:49] waiting to complete [11:58:05] and once we see that, we will try something else, unless someone disagrees [11:58:18] feel free to prepare but not deploy nothing else, however [11:58:41] the rollback was technically not a rollback given that helm had rolled back already [11:58:43] please speak up if I said somethin incorrect or want to change something [11:59:14] I would like to manually bump mw-api-int workers in codfw whenever suits [11:59:19] parsoid paged, acked both pages [11:59:21] I will run helmfile fo jobrunners and parsoid, [11:59:40] the second one was 5xx on wikifeeds_cluster [12:00:14] those are secondary failures cased from the primary issue, I think [12:00:21] hnowlan: prep the patch [12:00:33] go [12:00:40] hnowlan: and I can deploy [12:01:21] I'm just going to manually edit on the deploy server to get us headroom and deploy now [12:01:43] we found some really terrible bug [12:01:49] unless there's an objection [12:01:55] please coordinate on the deploy, hnowlan and effie [12:02:06] hnowlan: lets prep the patch [12:02:06] as long as you don't step on each other [12:02:10] https://trace.wikimedia.org/trace/55e2f0c6045326438c389c979a1244cd?uiFind=28e4c205b217b32d [12:02:18] this query doesn't look right [12:02:23] Amir1: give some context [12:02:26] Amir1: right [12:02:35] it should have more conditions [12:02:52] with the current set, it picks a lot of revisions to parse [12:03:00] if you are not debugging, deploying or monitoring, please help documenting the steps others are taking [12:03:03] Amir1: but that query was there yesterday too, right? [12:03:27] yeah, I think we have some "pathological" jobs [12:03:32] but not all [12:03:38] (becuase of the condition) [12:04:00] but I'll debug further, I might miss something obvious [12:04:23] pages resolving for wikifeeds and parsoid [12:04:24] lmk how I can help [12:04:45] volans: help me put steps takeon into the doc, please [12:04:48] parsoid as well [12:04:55] sure [12:04:57] hnowlan: I will merge and deoloy [12:05:03] volans: so others are up to date (based on IRC chat) [12:06:22] effie: if you approve I have the command ready to go [12:07:06] effie: approve? [12:07:07] mw-web-ro errors resolved as well [12:07:39] effie: or are you merging? [12:07:41] hnowlan: go [12:07:51] jynus: hugh and I are on it [12:07:51] please do hnowlan [12:08:05] ok, leave you on your own, please update when done [12:09:24] done [12:09:45] any change on errors/ improvement, etc [12:09:49] ? [12:10:08] jynus: nothing on the es front [12:10:14] jynus: pages resolved for wikifeeds, parsoid, mw-web-ro [12:10:21] es graphs still high [12:10:21] (obviously I don't expect seeing changes imediatelly) [12:10:26] claime: that's good [12:10:52] sorry, mw-web-ro still firing [12:11:07] any ideas, should we stop/kill/restart jobqueue in any way? [12:11:15] or any of the jobs? [12:11:51] I can see this is jobs, I ran a show processlit and put it there [12:11:54] then cat show_processlist_es2037 | awk '{print $3 }' | cut -d':' -f1 | xargs -I{} dig -x {} | grep -A1 "ANSWER SECTION:" > res_es2037 [12:12:02] then cat res_es2037 | sort | uniq -c [12:12:12] https://www.irccloud.com/pastebin/zaEvBSur/ [12:12:16] Amir1: I assume what you deployed first was what worked yesterday first, right? [12:12:36] jynus: It worked yesterday, but not today [12:12:41] We can´t kill a targeted job, best we could do is reduce concurrency, then redeploy cp-jq, then roll-restart mw-jobrunner (which would kill all jobs) [12:12:43] any other ideas to poke at the the job queue? [12:12:56] it hasn't worked yet. I think I want to either bring it down and fully disable it for now [12:13:04] Amir1: let's do that [12:13:40] claime: I know, but it seems massaging it worked yesterday, was asking other ideas to massage it today to at least mitigate the ongoing issues [12:14:40] marostegui: if looking at graphs, could you also have a look at edit rate and http errors to see how that is going? [12:14:48] yep [12:14:53] so you can update us on impact [12:15:17] I actually don't know whether setting to enabled to false would actually disable the job [12:15:17] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127500 [12:15:25] logstash mw errors look realatively low? [12:15:47] Amir1: it should [12:15:54] jynus: edit rate is barely impacted [12:16:07] shall we try this? [12:16:11] marostegui: that's good, I use it normally as an indicator of how bad uncached requests are [12:16:19] <_joe_> can I suggest instead to move mw-jobrunner back to php 7.4 entirely? [12:16:30] jynus: 500 and 503 are still very affected [12:16:47] ok, so mw errors logstash may be unreliable now [12:16:48] it's not a super important job, it puts RC entries about categories that have been added or removed to pages [12:17:02] <_joe_> Amir1: yeah but bear with me [12:17:21] <_joe_> if restarting all on php 7.4 fixes the issue, we might have found where the problem lies [12:17:56] the errors are coming from 7.4 [12:18:11] I mean, from both [12:18:17] <_joe_> it's due to es7 being overloaded [12:18:21] <_joe_> which affects everything [12:18:39] I can roll them all to 7.4, it is an easy one anyway [12:18:41] which of the 2 options suggested will be faster? [12:18:47] page again, acked [12:18:58] I'd try to decide for you by doing the faster first [12:19:06] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad [12:19:12] so we are not in analysis paralysis [12:19:16] still related right? [12:19:26] <_joe_> volans: no [12:20:00] sigh, checking [12:20:02] <_joe_> effie: let's try to move everything to 7.4 [12:20:05] we also have KubernetesDeploymentUnavailableReplicas [12:20:18] for mw-parsoid.codfw.main [12:20:24] <_joe_> sigh [12:20:54] I can look at the gateway [12:21:05] thx hnowlan [12:21:35] shouldn't be related [12:21:39] <_joe_> ok so [12:21:48] <_joe_> looks like parsoid in codfw is struggling [12:21:52] yes [12:22:04] volans: can you take over IC for me? [12:22:22] jynus: sure,what are the current assignments of people doing what? [12:22:28] re: parsoid, that could be caused solely by es being overloaded, right? [12:22:35] hnowlan: is looking at he gateway [12:22:35] <_joe_> claime: I think so yes [12:22:47] manuel is looking graphs for ES an overal impact [12:22:48] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/112750 hnowlan +1 ? [12:22:59] Amir is looking to disable a job [12:23:02] I'm just popping off ideas [12:23:25] <_joe_> effie: bad paste? [12:23:27] effi and joe are checking to restart some pods [12:23:30] understanding what's cause and what effect would help a lot to exclude focus our efforts on the effect-things [12:23:47] thx jynus [12:23:57] _joe_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127498 yes [12:23:58] * volans becomes IC [12:24:01] <_joe_> volans: something is overloading es, I would like to know which IPs are flooding it [12:24:31] _joe_: I can give you the list of IPs if you want [12:24:41] <_joe_> marostegui: or hostnames if you get them [12:24:42] I produced that list already [12:24:48] <_joe_> Amir1: where is it? [12:24:49] even dig I them [12:24:53] cumin1002 [12:24:54] my home [12:24:58] <_joe_> effie: that's not the right patch either [12:24:58] Deploying mw-jobrunner [12:24:59] show_processlist_es2037 [12:25:10] <_joe_> anyways, thanks [12:25:10] result of dig is this [12:25:23] https://www.irccloud.com/pastebin/zaEvBSur/ [12:25:27] <_joe_> Amir1: use phaste next time :) [12:25:35] I will! [12:26:29] _joe_: yeah :/ [12:26:30] https://phabricator.wikimedia.org/P74219 [12:27:25] thx added to the doc [12:27:39] _joe_: a new list: https://phabricator.wikimedia.org/P74220 [12:29:48] <_joe_> are we seeing any improvements? [12:29:49] api-ext might be traffic patterns? [12:29:52] Jobrunners are all reverted to 7.4 [12:29:54] I'm restarting prometheus mysqld exporter [12:29:59] on es2040 [12:30:06] _joe_: no on the es [12:30:17] <_joe_> I see parsoid is back to "working as expected" [12:30:28] <_joe_> no errors and latencies down [12:30:42] _joe_: still serving lots of 500 and 503 [12:30:43] it has been slowly going down for twenty minutes so it's hard to say the wind is blowing over or the changes are helping [12:30:43] <_joe_> same for mw-api-int [12:30:58] <_joe_> marostegui: where are you looking? [12:31:07] _joe_: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1 [12:31:14] https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=now-3h&to=now&viewPanel=5 [12:31:21] <_joe_> ah yes that's slightly slow to catch up [12:31:32] (slowly subsiding starting 12:15 UTC) [12:31:37] _joe_: And regarding es: https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=now-6h&to=now&viewPanel=3 [12:31:48] volans: do I update status page to monitoring? [12:32:07] let's wait a sec [12:32:17] <_joe_> yeah let's wait 3-5 minutes [12:32:44] <_joe_> so my hypothesis is that some job - maybe that maybe another [12:32:53] <_joe_> when running on php 8.1 does cause this effect [12:33:14] <_joe_> yeah errors are gone from the backends in mediawiki [12:34:29] when we looked at the traces (sampled) that are jobs and query es2037 during the outage, only that showed up, maybe it's just because it's high traffic job that also query this db [12:34:43] es2037 still not showing recovery [12:34:46] <_joe_> yeha that's possible [12:35:11] https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?var-kClass=SqlBlobStore_blob&orgId=1 - the miss ratio tracks with the outages here, so either we're requesting blobs for revisions that aren't in the cache, or we mangle the cache key somehow [12:35:40] but I haven't seen any other job doing any query to the es hosts we checked [12:35:41] <_joe_> mszabo: I suspect the latter [12:36:44] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1127017 would be neat here since currently we can't really attribute traces to a specific job unless they've done us the favor of issuing an attributed DB query from run() [12:36:53] SqlBlobStore_blob is memcached? [12:36:57] 1M cache miss hit per minute, lovely [12:36:59] volans: yes [12:39:53] <_joe_> now I don't know where we take the data here https://grafana.wikimedia.org/d/a97c66ff-0e10-4d2a-b9e1-37b96b7b4d35/parser-cache-misses?orgId=1&viewPanel=1&from=now-3h&to=now seems that it's "miss-redirect" in Parsercache [12:40:00] <_joe_> whatever that might mean [12:40:14] _joe_: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1 recovered [12:40:32] marostegui: monitoring then now? [12:40:32] <_joe_> marostegui: yeah as I told you, it's just slightly lagged [12:40:38] jynus: still bad [12:40:47] oh [12:40:49] <_joe_> marostegui: not user-facing bad though [12:40:49] jynus: es ^ [12:41:00] <_joe_> so I agree with jynus, we're monitoring [12:41:07] _joe_: Yeah, but es still way out of its normal values [12:41:18] this is just for the status page for users [12:41:20] <_joe_> marostegui: hence "monitoring" from the prespective of users [12:41:26] I think we should keep the ticket UBN open [12:41:28] yep I agree [12:41:33] +1 for monitoring [12:41:33] <_joe_> yeah 100% [12:41:40] we're not actively doing anything at this point [12:41:52] <_joe_> volans: well we might soon :) [12:42:06] also if you have some work to do, talk to managers so they priorize this [12:42:32] (some other work scheduled, I mean) [12:42:41] are we ok to update the status page with monitoring? [12:42:57] I'm gonna tag out, sorry folks [12:43:58] GatewayBackendErrorsHigh: api-gateway: resolved [12:44:07] <_joe_> volans: yes we are [12:44:07] hnowlan: did you do anything specifically? [12:44:30] * volans doing [12:45:03] still degraded performance or can I put operational? [12:45:15] es2037 seems to be doing better, still wait out of its normal values, but seems to be doing much better now [12:45:55] as an actionable, who could we ask to have a look at logstash (observability?) [12:46:05] so at a glance the cache keys don't seem off, this is a sample from ~3mins ago: "global:SqlBlobStore-blob:frwiki:es%3ADB%3A//cluster25/19854695?flags=utf-8,gzip" [12:46:22] <_joe_> so looking at parservache hit rate, it's back to normal for wikitext according to the dashboard. The only cache that seems to be affected is "parsoid_pcache" [12:46:27] volans: no, the spike isn't abnormal for the service but it being sustained caused the page I think. [12:46:38] volans: parsoid is still serving some errors, I am looking into it [12:47:06] <_joe_> but ES is still high usage right? [12:47:12] _joe_: yes [12:47:13] <_joe_> Amir1: let's try to kill that job [12:47:23] 🗡️ [12:47:51] merging [12:48:04] which one are youkilling now? [12:48:05] <_joe_> we won't know if that helps though [12:48:13] categorymembership [12:48:14] <_joe_> I see cache hit ratios reccovering [12:48:17] thx [12:48:47] we still have high signal with the ES load, to some extent [12:49:03] it is not trending down [12:49:03] <_joe_> jynus: is it slowly recovering? [12:49:09] see my last comment [12:49:21] <_joe_> yeah we wrote at the same time [12:49:28] https://grafana.wikimedia.org/goto/kyWZ1d2NR?orgId=1 [12:49:34] _joe_: we have 1 datapoint per minute in the mariadb dashboard [12:49:37] so give time :) [12:49:52] <_joe_> volans: not really, no [12:49:57] <_joe_> Amir1: let's try your patch [12:50:02] but I think mw stack recovered 10 minutes ago [12:50:08] going forward [12:50:25] I'm not saying give it time to recover, but give it time after a change to see the effect ;) [12:50:27] eqiad ongoing, codfw next [12:50:41] volans: sorry, I missunderstood, sure! [12:51:24] I was referring to 5xx -> ES graph [12:52:33] deployed [12:54:43] thx [12:55:01] waiting for grafana :D [12:55:17] yeah :( [12:55:44] <_joe_> grafana struggles at times yes [12:55:53] <_joe_> maybe we can ask to look into it [12:56:04] <_joe_> (I think it's thanos rather than grafana, but still) [12:56:23] I'd love to have like 15s datapoints [12:56:25] prometheus latency I think is normal, it groups sometimes with 5m aggregation [12:56:39] so that's ok [12:57:52] but I wonder if logstash got overloaded, of the errors caused extra strain on the app servers for logs to take more time to appear on logstash [12:57:59] *or [12:58:28] I don't see effects yet on the es2037 graphs [12:59:03] yeah, still the same [12:59:10] show processlist is much smaller now [12:59:30] another possibility is that there is one script that's overwhelming everything [12:59:39] (one connection) [12:59:51] Amir1: would that connection use wikiuser or wikiadmin? [13:00:07] I'm not seeing any wikiadmin in show processlist [13:00:27] quick question as I haven’t been following this discussion – is this deployment-blocking? [13:00:29] if it's something that opens and closes the connectiosn very fast we might not see it right? [13:00:30] yeah, I didn't see any yesterday. But as you said script, I was wondering if that'd use wikiadmin [13:00:31] I have a script that is actually checking blob of every revision of all wikis up to 2007 but that should be wikiadmin [13:00:59] (and definitely not commons or wikidata) [13:01:04] (sorry, nevermind, deployment window is in an hour not now) [13:01:50] the script is actually over now [13:01:59] (for days I guess) [13:02:51] [14:00:55] <+icinga-wm> PROBLEM - SSH on gerrit2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:54] is that something used? [13:03:04] used? [13:03:12] like is that host in use? [13:04:15] dunno, has role gerrit in puppet, we have 1003,2002,2003 [13:04:36] git works for me so not the active one I'd say [13:04:47] I will create a task [13:05:10] still nothing on the various graphs [13:05:13] what are next steps? [13:05:17] ideas? [13:06:33] I have none volans [13:07:38] <_joe_> please restart that job if it didn't help [13:07:51] <_joe_> I fear the problem will self-solve in a couple hours [13:08:01] why? [13:08:23] The only thing I see standing out, is parsoid stil serving some errors [13:08:47] might be a set of pages that are particularly hard to parse or load from ES? [13:09:03] parsoid mw errors: https://logstash.wikimedia.org/goto/ec07723df325378da6b2b4c48d59c1d8 [13:09:31] dcausse: not on all sections at the same time [13:09:49] <_joe_> anyways, I have to step away from the incident, sorry [13:10:06] <_joe_> marostegui: I fear the problem has been a swath of cache invalidations [13:10:26]