[08:30:39] hmmm I got a failure on the puppet reimage of lvs6001 due to a failure of a puppet run on cumin1002 and that triggered that the cookbook revoked lvs6001 puppet cert if I'm properly reading the cookbook output [08:31:08] https://www.irccloud.com/pastebin/yeMNMSp8/ [08:31:39] volans: is this expected behavior? the host was basically ok and now I need to reimage it again :_) [08:32:13] vgutierrez: doens't mean forcely that you have to do it [08:36:00] * vgutierrez wondering why that puppet-run failed [08:36:22] vgutierrez: I'm checking between two options what's the best for you [08:37:33] vgutierrez: I don't see a failure on puppetboard so I think run-puppet-agent timed out while waiting for the lock due to another puppet runnng [08:37:48] in 3 seconds? [08:38:10] `00:03<00:00, 3.20s/hosts` [08:38:57] yeah weird [08:39:01] rc=1 [08:39:07] so no ssh problem [08:39:37] started 08:20:09,353 failed at 08:20:12,540 [08:40:11] Mar 13 08:20:12 cumin1002 puppet-agent[3742273]: Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists) [08:40:15] race condition? [08:40:59] most likely between the 'wait_for_puppet' check in run-puppet-agent and when it actually runs puppet [08:41:05] lucky you! :D [08:41:46] lovely, so the reimage cookbook stopped before rebooting the host [08:41:52] so for the reimage, two options, one is to re-run it with --no-pxe (and I think you need --new) and it should work unless it fails some pre-condition. Other option [08:43:57] sre.puppet.renew-cert with "D{lvs6001.drmrs.wmnet}" (direct backend to bypass the fact it's not in puppetdb already) [08:45:03] and then a manual reboot a manual check it run puppet at boot successfully and a manual run of the netbox script https://netbox.wikimedia.org/extras/scripts/2/ [08:45:31] I'll send a fix for the reimage to not remove it from puppetdb past the first puppet run [08:46:58] trying with --no-pxe [08:47:08] I had to manually skip the downtime of the host but nothing too bad so far [08:47:19] finger crossed [08:48:40] https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/1127460 [09:04:13] hmm host seems good, not so sure about netbox [09:04:30] https://www.irccloud.com/pastebin/UJz7xxvE/ [09:06:12] volans: hmm actually we got the same problem on lvs6002 and in lvs6003 [09:06:23] 301 topranks [09:06:26] netbox has the names of the old interfaces (bullseye) rather than the new ones (bookworm) [09:06:43] it's probably because of the vlans specific to LVS hosts [09:06:47] but I don't remember seeing this error before [09:09:33] hmm thanks [09:09:40] :) [09:09:45] * vgutierrez hides [09:09:47] I thought we fixed this already [09:10:00] let me have a quick look [09:14:00] at least all lvs@drmrs are impacted [09:14:20] dunno about ulsfo, eqsin and magru but those could be affected as well [09:18:00] yeah fwiw the issue is because they have the interface relations properly modelled in Netbox [09:18:08] I assume more recently imported than the others [09:18:11] https://usercontent.irccloud-cdn.com/file/ELFiK5Vi/image.png [09:18:36] the script is trying to delete enp175s0f0np0, but the db is throwing an error as the vlan interfaces are children of it [09:18:55] so we should first create the new interface, move the vlan there and then delete? [09:19:01] or just rname the existing interface? [09:19:03] I'll have to fix up the script, it won't take long but need a short amount of time [09:19:26] I think it's easier to delete & re-create, as it covers all potential changes in the setup of vlans on a host [09:19:36] which can even be completely new hardware for instance [09:20:12] if we try to design it to migrate between two "well understood" interface setups that we expect it's probably more trouble to maintain, if we ever have something else down the road [09:21:34] topranks: if we rename those 3 hosts's main interface manually and then re-run the script would it work? [09:21:52] I'll fix the script cos it needs to happen [09:21:59] ok [09:22:05] if you're blocked now yep, we can just delete all the interfaces on the host (bar mgmt) in Netbox [09:22:19] well.... what stage are you at now, is the reimage ongoing? [09:22:24] or does it need to be restarted? [09:22:26] it's finished [09:22:43] I think [09:22:48] netbox sync is the last bit [09:23:17] ok, so we only need to re-run the puppetdb import script? [09:23:58] yes when it's ready with the fix [09:24:06] v.alentin can confirm though [09:24:24] ok.... fwiw I moved to fast and already deleted the secondary ints on lvs6001 [09:24:37] but yep I'll prep the fix and test with netbox-next which is the same [09:25:13] vgutierrez: I also deleted the secondary ints on lvs6002 and lvs6003 manually, so you won't get the issue with them even if before the fix is in place [09:35:04] ack [09:47:29] topranks: that's the first part of the fix BTW, right now in netbox we got the old ifaces [09:48:02] the script just imports what's in puppetdb, so that's where the problem is if it's pulling in the old names [09:48:10] if it's not been run yet that's expected [09:48:31] enp175s0f0np0 instead of ens3f0np0 [09:48:35] hmm so I need to run it manually? [09:51:03] nah I'll tidy that up when I have the new script don't worry about that [09:51:14] I only deleted the stuff manually to unblock you if it was needed [09:51:24] btw - I don't see any vlan sub-interfaces on lvs6001 right now? [09:51:43] (in my testing they weren't re-created properly, but it's because they are not in puppetdb, because they are not configured on the server) [09:53:05] topranks: no vlans needed.. IPIP magic [09:53:24] I think I mentioned it to you the other day that liberica roles aren't creating the vlans [09:53:27] ok... well there you go, better to delete them than rename cos they don't exist anymore [09:53:32] ok [09:53:40] that's good [09:53:42] as it's a requirement for liberica to use IPIP [09:53:51] (so we can switch back and forth between IPVS and katran) [09:57:02] actually that is what the difference is here [09:57:17] the script already catered for the scenario where the parent_int of the vlans changed [09:57:32] it would process all of those normally before it got to deleting non-existant interfaces [09:57:50] but here - as there are no longer any vlan ints on the box - the old vlan ints were left untouched [09:58:07] and then when it tried to delete the old physical int (with old wrong name), it failed [10:31:51] topranks: so that's my fault, please increase your beer counter [10:32:26] <_joe_> oh you have beer counters? [10:40:59] _joe_: https://en.wikipedia.org/wiki/Two_pound_coin#/media/File:British_two_pound_coin_2016_obverse.png is what they look like :) [10:41:23] * Emperor definitely showing their age here, it's been years since you could actually buy a pint with a beer token [11:05:14] <_joe_> I was about to ask [11:36:53] so we have another bad rollout that is related with yesterday's incident turns out [11:37:05] I set up an alert on the status page errors still ongoing [11:37:17] just open search had lag [11:37:25] Amir1: pinging you as we are discussing it [11:37:38] scap has rolled back and we have restarted jobqueue [11:37:53] joe suspects a misbehaving job [11:38:29] so I will make another rollout attempt by leaving out the jobrunners [11:38:43] some correlation as regards jobs fairly significant p99 spikes for most jobs when the deploy starts - https://grafana.wikimedia.org/goto/3Rze5O2NR?orgId=1 [11:38:47] jynus: status? [11:38:57] <_joe_> hnowlan: that's a consequence yes [11:39:10] effie: the same as yesterday [11:39:19] the initial peak was bad, but errors still ongoing [11:39:30] jynus: is it safe to assume that it will recover eventually ? [11:39:39] not at the moment [11:39:46] shite [11:40:26] es only is up because it kills queries, which is no good [11:40:30] <_joe_> effie: mw-apiint in codfw is sufering a lot [11:40:38] <_joe_> maybe we should raise the number of pods? [11:40:44] _joe_: yes thank you just saw it [11:41:13] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127495 [11:41:15] <_joe_> at the very least a roll restart [11:41:16] Amir1: So yes, let's try that [11:41:18] were the pods restarted? [11:41:36] jynus: some yes [11:41:49] Amir1: I can merge and deploy [11:41:55] effie: on it [11:41:58] I was in a meeting and I have anothe rone in 4 minutes [11:42:06] yeah, if anything, things are getting worse [11:42:07] ok [11:42:16] volans: that is ok, we are enough on it atm [11:42:29] didn't get any page, saw the page now on -ops [11:42:36] Amir1: What happens if we disable that cron entirely? [11:42:51] <_joe_> effie: mw-api-int in codfw needs to be roll-restarted, should I do it? [11:42:53] Cause that's what we will have to do in the end, if it gets fixed by decreasing it all the time [11:42:56] volans: It started when I pinged you, no worries [11:43:04] _joe_: I am already there, can do it [11:43:17] <_joe_> marostegui: the problem is triggered by switching $something to php 8.1 [11:43:22] jynus: I got no ping :( [11:43:23] <_joe_> we'll have to find out what [11:43:41] _joe_: But why is reducing that cronjob fixing it? [11:43:49] <_joe_> marostegui: unclear [11:43:54] got it [11:43:56] _joe_: it could also be something triggered by the deploy, and not necesarilly the restart [11:44:02] <_joe_> but the problem clearly gets triggered by the deploy [11:44:07] I don't think disabling it will cause too much disruption, it doesn't serve a massive important feature [11:44:10] *not necesarilly the upgrade I mean [11:44:15] but I want to make sure it's this [11:44:21] <_joe_> jynus: the roll restart is because when php workers stay at 100% busy they hardly ever recover well by themselves [11:44:23] just a jobque restart or something [11:44:34] I will start by rolling back the patches I deployed [11:44:42] yes, no problem with that [11:44:51] what es host is overwhelmed? [11:44:59] Amir1: all shards [11:45:00] es2037 for instance [11:45:06] <_joe_> effie: wait you didn't revert already? [11:45:07] Amir1: https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=1741763693093&to=1741785293093 [11:45:15] mostly es 6 and es7, but all affected [11:45:31] <_joe_> effie: then yes, please [11:46:06] Amir: https://grafana.wikimedia.org/goto/SF6hpOhNR?orgId=1 [11:46:48] gone to staging, going to eqiad [11:47:52] Amir1: ok [11:48:00] Amir1: ping when it is deployed so we can monitor if it gets better [11:48:28] it's deployed in codfw too now [11:49:07] checking impact [11:49:31] <_joe_> positive I'd say [11:49:34] <_joe_> wth [11:49:49] es2037 is not recovering yet [11:49:57] <_joe_> I'm looking at mediawiki [11:49:59] yeah, no es improvement yet [11:50:06] _joe_: mediawiki errors has lag [11:50:12] <_joe_> not errors [11:50:15] <_joe_> latencies [11:50:30] I will run helmfile on the affected releases [11:50:51] <_joe_> effie: wait, they seem to be doing better rn [11:50:52] logstash could take some time to reflect last minute errors be aware [11:51:02] <_joe_> uhm not really [11:51:03] _joe_: it is essentially the reverts [11:51:11] <_joe_> effie: ok [11:51:14] so it should not affect things much [11:51:29] on es2037 when I ran show processlist, unlike yesterday, I can't find jobrunner IPs [11:51:55] yesterday it was all jobrunner IPs right now, it's mw-ext or web [11:52:00] marostegui: you keep monitoring es if you can [11:52:06] jynus: I am [11:52:12] <_joe_> notthing is fixed at [11:52:15] <_joe_> *atm [11:52:26] <_joe_> effie: can you please start from mw-api-int? [11:52:35] <_joe_> in codfw even [11:52:44] 5XX to clients rising [11:52:55] it seems that the helmfile rolling restart never completed [11:53:03] I am looking into it [11:53:25] <_joe_> effie: which restart? [11:53:34] the rolling pod restart [11:53:41] for jobqueue? [11:53:42] so nothing is reverted yet? [11:53:43] <_joe_> of which namespace/release? [11:53:51] <_joe_> marostegui: apparently [11:53:54] now it's job runners [11:54:00] _joe_: mw-api-int [11:54:15] Amir1: es still the same [11:54:16] it takes some time for jobs to finish [11:54:27] (so no new one being queued) [11:54:45] I'm fairly certain this job has a pathological bug [11:54:47] Amir1: but I belive last time recovery was rather quick? [11:54:58] that's not my recollection [11:54:58] jynus: Not that quick, it took a few minutes [11:55:08] ok, I stay corrected [11:55:52] one single job making this many queries: https://trace.wikimedia.org/trace/49b139c793c347fe58fc9b414e52f9d6 [11:55:57] Running helmfile on mw-api-int, to pick up the rollback [11:56:07] I err on the side of actually fully disabling it if it doesn't recover [11:57:12] Let's try one thing at a time [11:57:18] effie: let us know when it's finished [11:57:20] so, to sumarize [11:57:34] a patch to reduce concurrecny was deployed? Amir ? [11:57:41] jynus: yes [11:57:45] ongoing rollback by effie [11:57:49] waiting to complete [11:58:05] and once we see that, we will try something else, unless someone disagrees [11:58:18] feel free to prepare but not deploy nothing else, however [11:58:41] the rollback was technically not a rollback given that helm had rolled back already [11:58:43] please speak up if I said somethin incorrect or want to change something [11:59:14] I would like to manually bump mw-api-int workers in codfw whenever suits [11:59:19] parsoid paged, acked both pages [11:59:21] I will run helmfile fo jobrunners and parsoid, [11:59:40] the second one was 5xx on wikifeeds_cluster [12:00:14] those are secondary failures cased from the primary issue, I think [12:00:21] hnowlan: prep the patch [12:00:33] go [12:00:40] hnowlan: and I can deploy [12:01:21] I'm just going to manually edit on the deploy server to get us headroom and deploy now [12:01:43] we found some really terrible bug [12:01:49] unless there's an objection [12:01:55] please coordinate on the deploy, hnowlan and effie [12:02:06] hnowlan: lets prep the patch [12:02:06] as long as you don't step on each other [12:02:10] https://trace.wikimedia.org/trace/55e2f0c6045326438c389c979a1244cd?uiFind=28e4c205b217b32d [12:02:18] this query doesn't look right [12:02:23] Amir1: give some context [12:02:26] Amir1: right [12:02:35] it should have more conditions [12:02:52] with the current set, it picks a lot of revisions to parse [12:03:00] if you are not debugging, deploying or monitoring, please help documenting the steps others are taking [12:03:03] Amir1: but that query was there yesterday too, right? [12:03:27] yeah, I think we have some "pathological" jobs [12:03:32] but not all [12:03:38] (becuase of the condition) [12:04:00] but I'll debug further, I might miss something obvious [12:04:23] pages resolving for wikifeeds and parsoid [12:04:24] lmk how I can help [12:04:45] volans: help me put steps takeon into the doc, please [12:04:48] parsoid as well [12:04:55] sure [12:04:57] hnowlan: I will merge and deoloy [12:05:03] volans: so others are up to date (based on IRC chat) [12:06:22] effie: if you approve I have the command ready to go [12:07:06] effie: approve? [12:07:07] mw-web-ro errors resolved as well [12:07:39] effie: or are you merging? [12:07:41] hnowlan: go [12:07:51] jynus: hugh and I are on it [12:07:51] please do hnowlan [12:08:05] ok, leave you on your own, please update when done [12:09:24] done [12:09:45] any change on errors/ improvement, etc [12:09:49] ? [12:10:08] jynus: nothing on the es front [12:10:14] jynus: pages resolved for wikifeeds, parsoid, mw-web-ro [12:10:21] es graphs still high [12:10:21] (obviously I don't expect seeing changes imediatelly) [12:10:26] claime: that's good [12:10:52] sorry, mw-web-ro still firing [12:11:07] any ideas, should we stop/kill/restart jobqueue in any way? [12:11:15] or any of the jobs? [12:11:51] I can see this is jobs, I ran a show processlit and put it there [12:11:54] then cat show_processlist_es2037 | awk '{print $3 }' | cut -d':' -f1 | xargs -I{} dig -x {} | grep -A1 "ANSWER SECTION:" > res_es2037 [12:12:02] then cat res_es2037 | sort | uniq -c [12:12:12] https://www.irccloud.com/pastebin/zaEvBSur/ [12:12:16] Amir1: I assume what you deployed first was what worked yesterday first, right? [12:12:36] jynus: It worked yesterday, but not today [12:12:41] We can´t kill a targeted job, best we could do is reduce concurrency, then redeploy cp-jq, then roll-restart mw-jobrunner (which would kill all jobs) [12:12:43] any other ideas to poke at the the job queue? [12:12:56] it hasn't worked yet. I think I want to either bring it down and fully disable it for now [12:13:04] Amir1: let's do that [12:13:40] claime: I know, but it seems massaging it worked yesterday, was asking other ideas to massage it today to at least mitigate the ongoing issues [12:14:40] marostegui: if looking at graphs, could you also have a look at edit rate and http errors to see how that is going? [12:14:48] yep [12:14:53] so you can update us on impact [12:15:17] I actually don't know whether setting to enabled to false would actually disable the job [12:15:17] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127500 [12:15:25] logstash mw errors look realatively low? [12:15:47] Amir1: it should [12:15:54] jynus: edit rate is barely impacted [12:16:07] shall we try this? [12:16:11] marostegui: that's good, I use it normally as an indicator of how bad uncached requests are [12:16:19] <_joe_> can I suggest instead to move mw-jobrunner back to php 7.4 entirely? [12:16:30] jynus: 500 and 503 are still very affected [12:16:47] ok, so mw errors logstash may be unreliable now [12:16:48] it's not a super important job, it puts RC entries about categories that have been added or removed to pages [12:17:02] <_joe_> Amir1: yeah but bear with me [12:17:21] <_joe_> if restarting all on php 7.4 fixes the issue, we might have found where the problem lies [12:17:56] the errors are coming from 7.4 [12:18:11] I mean, from both [12:18:17] <_joe_> it's due to es7 being overloaded [12:18:21] <_joe_> which affects everything [12:18:39] I can roll them all to 7.4, it is an easy one anyway [12:18:41] which of the 2 options suggested will be faster? [12:18:47] page again, acked [12:18:58] I'd try to decide for you by doing the faster first [12:19:06] GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad [12:19:12] so we are not in analysis paralysis [12:19:16] still related right? [12:19:26] <_joe_> volans: no [12:20:00] sigh, checking [12:20:02] <_joe_> effie: let's try to move everything to 7.4 [12:20:05] we also have KubernetesDeploymentUnavailableReplicas [12:20:18] for mw-parsoid.codfw.main [12:20:24] <_joe_> sigh [12:20:54] I can look at the gateway [12:21:05] thx hnowlan [12:21:35] shouldn't be related [12:21:39] <_joe_> ok so [12:21:48] <_joe_> looks like parsoid in codfw is struggling [12:21:52] yes [12:22:04] volans: can you take over IC for me? [12:22:22] jynus: sure,what are the current assignments of people doing what? [12:22:28] re: parsoid, that could be caused solely by es being overloaded, right? [12:22:35] hnowlan: is looking at he gateway [12:22:35] <_joe_> claime: I think so yes [12:22:47] manuel is looking graphs for ES an overal impact [12:22:48] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/112750 hnowlan +1 ? [12:22:59] Amir is looking to disable a job [12:23:02] I'm just popping off ideas [12:23:25] <_joe_> effie: bad paste? [12:23:27] effi and joe are checking to restart some pods [12:23:30] understanding what's cause and what effect would help a lot to exclude focus our efforts on the effect-things [12:23:47] thx jynus [12:23:57] _joe_: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/1127498 yes [12:23:58] * volans becomes IC [12:24:01] <_joe_> volans: something is overloading es, I would like to know which IPs are flooding it [12:24:31] _joe_: I can give you the list of IPs if you want [12:24:41] <_joe_> marostegui: or hostnames if you get them [12:24:42] I produced that list already [12:24:48] <_joe_> Amir1: where is it? [12:24:49] even dig I them [12:24:53] cumin1002 [12:24:54] my home [12:24:58] <_joe_> effie: that's not the right patch either [12:24:58] Deploying mw-jobrunner [12:24:59] show_processlist_es2037 [12:25:10] <_joe_> anyways, thanks [12:25:10] result of dig is this [12:25:23] https://www.irccloud.com/pastebin/zaEvBSur/ [12:25:27] <_joe_> Amir1: use phaste next time :) [12:25:35] I will! [12:26:29] _joe_: yeah :/ [12:26:30] https://phabricator.wikimedia.org/P74219 [12:27:25] thx added to the doc [12:27:39] _joe_: a new list: https://phabricator.wikimedia.org/P74220 [12:29:48] <_joe_> are we seeing any improvements? [12:29:49] api-ext might be traffic patterns? [12:29:52] Jobrunners are all reverted to 7.4 [12:29:54] I'm restarting prometheus mysqld exporter [12:29:59] on es2040 [12:30:06] _joe_: no on the es [12:30:17] <_joe_> I see parsoid is back to "working as expected" [12:30:28] <_joe_> no errors and latencies down [12:30:42] _joe_: still serving lots of 500 and 503 [12:30:43] it has been slowly going down for twenty minutes so it's hard to say the wind is blowing over or the changes are helping [12:30:43] <_joe_> same for mw-api-int [12:30:58] <_joe_> marostegui: where are you looking? [12:31:07] _joe_: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1 [12:31:14] https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=now-3h&to=now&viewPanel=5 [12:31:21] <_joe_> ah yes that's slightly slow to catch up [12:31:32] (slowly subsiding starting 12:15 UTC) [12:31:37] _joe_: And regarding es: https://grafana.wikimedia.org/d/d251bef4-d946-4bea-a8a5-b02a3546762e/mariadb?orgId=1&refresh=1m&var-job=All&var-server=es2037&var-port=9104&from=now-6h&to=now&viewPanel=3 [12:31:48] volans: do I update status page to monitoring? [12:32:07] let's wait a sec [12:32:17] <_joe_> yeah let's wait 3-5 minutes [12:32:44] <_joe_> so my hypothesis is that some job - maybe that maybe another [12:32:53] <_joe_> when running on php 8.1 does cause this effect [12:33:14] <_joe_> yeah errors are gone from the backends in mediawiki [12:34:29] when we looked at the traces (sampled) that are jobs and query es2037 during the outage, only that showed up, maybe it's just because it's high traffic job that also query this db [12:34:43] es2037 still not showing recovery [12:34:46] <_joe_> yeha that's possible [12:35:11] https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?var-kClass=SqlBlobStore_blob&orgId=1 - the miss ratio tracks with the outages here, so either we're requesting blobs for revisions that aren't in the cache, or we mangle the cache key somehow [12:35:40] but I haven't seen any other job doing any query to the es hosts we checked [12:35:41] <_joe_> mszabo: I suspect the latter [12:36:44] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/EventBus/+/1127017 would be neat here since currently we can't really attribute traces to a specific job unless they've done us the favor of issuing an attributed DB query from run() [12:36:53] SqlBlobStore_blob is memcached? [12:36:57] 1M cache miss hit per minute, lovely [12:36:59] volans: yes [12:39:53] <_joe_> now I don't know where we take the data here https://grafana.wikimedia.org/d/a97c66ff-0e10-4d2a-b9e1-37b96b7b4d35/parser-cache-misses?orgId=1&viewPanel=1&from=now-3h&to=now seems that it's "miss-redirect" in Parsercache [12:40:00] <_joe_> whatever that might mean [12:40:14] _joe_: https://grafana.wikimedia.org/d/000000503/varnish-http-errors?orgId=1 recovered [12:40:32] marostegui: monitoring then now? [12:40:32] <_joe_> marostegui: yeah as I told you, it's just slightly lagged [12:40:38] jynus: still bad [12:40:47] oh [12:40:49] <_joe_> marostegui: not user-facing bad though [12:40:49] jynus: es ^ [12:41:00] <_joe_> so I agree with jynus, we're monitoring [12:41:07] _joe_: Yeah, but es still way out of its normal values [12:41:18] this is just for the status page for users [12:41:20] <_joe_> marostegui: hence "monitoring" from the prespective of users [12:41:26] I think we should keep the ticket UBN open [12:41:28] yep I agree [12:41:33] +1 for monitoring [12:41:33] <_joe_> yeah 100% [12:41:40] we're not actively doing anything at this point [12:41:52] <_joe_> volans: well we might soon :) [12:42:06] also if you have some work to do, talk to managers so they priorize this [12:42:32] (some other work scheduled, I mean) [12:42:41] are we ok to update the status page with monitoring? [12:42:57] I'm gonna tag out, sorry folks [12:43:58] GatewayBackendErrorsHigh: api-gateway: resolved [12:44:07] <_joe_> volans: yes we are [12:44:07] hnowlan: did you do anything specifically? [12:44:30] * volans doing [12:45:03] still degraded performance or can I put operational? [12:45:15] es2037 seems to be doing better, still wait out of its normal values, but seems to be doing much better now [12:45:55] as an actionable, who could we ask to have a look at logstash (observability?) [12:46:05] so at a glance the cache keys don't seem off, this is a sample from ~3mins ago: "global:SqlBlobStore-blob:frwiki:es%3ADB%3A//cluster25/19854695?flags=utf-8,gzip" [12:46:22] <_joe_> so looking at parservache hit rate, it's back to normal for wikitext according to the dashboard. The only cache that seems to be affected is "parsoid_pcache" [12:46:27] volans: no, the spike isn't abnormal for the service but it being sustained caused the page I think. [12:46:38] volans: parsoid is still serving some errors, I am looking into it [12:47:06] <_joe_> but ES is still high usage right? [12:47:12] _joe_: yes [12:47:13] <_joe_> Amir1: let's try to kill that job [12:47:23] 🗡️ [12:47:51] merging [12:48:04] which one are youkilling now? [12:48:05] <_joe_> we won't know if that helps though [12:48:13] categorymembership [12:48:14] <_joe_> I see cache hit ratios reccovering [12:48:17] thx [12:48:47] we still have high signal with the ES load, to some extent [12:49:03] it is not trending down [12:49:03] <_joe_> jynus: is it slowly recovering? [12:49:09] see my last comment [12:49:21] <_joe_> yeah we wrote at the same time [12:49:28] https://grafana.wikimedia.org/goto/kyWZ1d2NR?orgId=1 [12:49:34] _joe_: we have 1 datapoint per minute in the mariadb dashboard [12:49:37] so give time :) [12:49:52] <_joe_> volans: not really, no [12:49:57] <_joe_> Amir1: let's try your patch [12:50:02] but I think mw stack recovered 10 minutes ago [12:50:08] going forward [12:50:25] I'm not saying give it time to recover, but give it time after a change to see the effect ;) [12:50:27] eqiad ongoing, codfw next [12:50:41] volans: sorry, I missunderstood, sure! [12:51:24] I was referring to 5xx -> ES graph [12:52:33] deployed [12:54:43] thx [12:55:01] waiting for grafana :D [12:55:17] yeah :( [12:55:44] <_joe_> grafana struggles at times yes [12:55:53] <_joe_> maybe we can ask to look into it [12:56:04] <_joe_> (I think it's thanos rather than grafana, but still) [12:56:23] I'd love to have like 15s datapoints [12:56:25] prometheus latency I think is normal, it groups sometimes with 5m aggregation [12:56:39] so that's ok [12:57:52] but I wonder if logstash got overloaded, of the errors caused extra strain on the app servers for logs to take more time to appear on logstash [12:57:59] *or [12:58:28] I don't see effects yet on the es2037 graphs [12:59:03] yeah, still the same [12:59:10] show processlist is much smaller now [12:59:30] another possibility is that there is one script that's overwhelming everything [12:59:39] (one connection) [12:59:51] Amir1: would that connection use wikiuser or wikiadmin? [13:00:07] I'm not seeing any wikiadmin in show processlist [13:00:27] quick question as I haven’t been following this discussion – is this deployment-blocking? [13:00:29] if it's something that opens and closes the connectiosn very fast we might not see it right? [13:00:30] yeah, I didn't see any yesterday. But as you said script, I was wondering if that'd use wikiadmin [13:00:31] I have a script that is actually checking blob of every revision of all wikis up to 2007 but that should be wikiadmin [13:00:59] (and definitely not commons or wikidata) [13:01:04] (sorry, nevermind, deployment window is in an hour not now) [13:01:50] the script is actually over now [13:01:59] (for days I guess) [13:02:51] [14:00:55] <+icinga-wm> PROBLEM - SSH on gerrit2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:02:54] is that something used? [13:03:04] used? [13:03:12] like is that host in use? [13:04:15] dunno, has role gerrit in puppet, we have 1003,2002,2003 [13:04:36] git works for me so not the active one I'd say [13:04:47] I will create a task [13:05:10] still nothing on the various graphs [13:05:13] what are next steps? [13:05:17] ideas? [13:06:33] I have none volans [13:07:38] <_joe_> please restart that job if it didn't help [13:07:51] <_joe_> I fear the problem will self-solve in a couple hours [13:08:01] why? [13:08:23] The only thing I see standing out, is parsoid stil serving some errors [13:08:47] might be a set of pages that are particularly hard to parse or load from ES? [13:09:03] parsoid mw errors: https://logstash.wikimedia.org/goto/ec07723df325378da6b2b4c48d59c1d8 [13:09:31] dcausse: not on all sections at the same time [13:09:49] <_joe_> anyways, I have to step away from the incident, sorry [13:10:06] <_joe_> marostegui: I fear the problem has been a swath of cache invalidations [13:10:26] Looking at it, it is the circuit breaking which is doing its job I reckon [13:10:28] <_joe_> in any case, ping me if nothing improves in 1-2 hours [13:10:35] thanks joe [13:10:44] there's also a spike in gadgets-definitions cache misses in the same timeframe: https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?var-kClass=gadgets_definition&orgId=1 [13:10:54] parsing triggers a lookup for the gadgets list because of the math extension [13:11:18] interestingly though, this should be a low cardinality key [13:11:26] I tried this https://phabricator.wikimedia.org/P74221 [13:11:56] ignoring LoadMonitor [13:11:57] <_joe_> mszabo: the issue seems to be widespread https://grafana.wikimedia.org/d/lqE4lcGWz/wanobjectcache-key-group?var-kClass=backlinks&orgId=1&from=now-2d&to=now [13:12:00] Amir1: in greek please? [13:12:11] wikidata or commons seems to be doing a lot of queries there [13:12:21] thank you <3 [13:12:53] I don't have a baseline of normal right now to see if anything stands out [13:16:10] are the public facing errors low enough to set the status page to operational while still the incident is in monitoring? [13:16:28] At no time in the last 6 months, ES were so loaded as during those spikes [13:17:02] I was hoping to see a smaller version of that on other deploys [13:17:04] or something [13:17:11] I take a break [13:17:31] I will too [13:25:26] _joe_: there's a spike in memcached errors which seems to explain it - https://logstash.wikimedia.org/goto/2f7eb9e7a3437283b1e4ac423087de17 [13:27:22] if noone objects I'll set the status page to operational (inciident still open in monitoring) as the user facing errors AFAICT recovered [13:28:49] volans: deployers are asking if they can continue. I would defer that decision to serviceops. [13:29:12] I concur [13:30:07] mszabo: I don't know enough, but a mass memcache error leading to misses and producing es overload would fit [13:31:01] it looks quite likely since this total hits to memcached is stable and hasn't changed, if something new triggers this, it should add up at least [13:31:12] (the total should go up) [13:31:37] could the upgrade, or just a deploy alter the memcache config or behaviour to make memcache requests fail? [13:31:50] but this hasn't happened, probably means memcached decided to fail [13:32:13] yeah, same question, I don't know how to correlate it [13:36:07] https://grafana.wikimedia.org/d/000000316/memcache?orgId=1&from=now-6h&to=now&viewPanel=40 [13:36:13] can I get a +/-1 on the above intention for the status page please? [13:36:22] +1 [13:36:42] volans: +1 from me [13:37:28] thx [13:39:16] +1 [13:39:31] I need to prepare for the interview, will catch up later [13:40:47] The graph I linked says calls to memcached has reduced, that would explain a lot [13:40:56] it doesn't even try to read from the cache [13:40:58] mszabo: so I have been wondering for quite some time why the amount of errors from the mediawiki pov, is not reflecred on mcrouter's pov [13:41:41] I think I found why, it seems like mw was trying to connect to the default memecached address [13:42:02] memecache is deliberate [13:42:53] and not the mw-mcrouter address, which means that the pods created, something was wrong with mounting the file containing that info [13:43:21] I've put memcache on top of the doc, with an interrogation [13:43:28] once memcached is fixed, we should revert disabling the job [13:43:35] https://logstash.wikimedia.org/goto/a22269d0d2848fb7bddb067945249e15 [13:43:41] Amir1: it is fixed [13:45:28] deploying the revert then [13:45:29] in https://logstash.wikimedia.org/goto/a22269d0d2848fb7bddb067945249e15 , we see mw-web and mw-api-ext having their normal amount of errors [13:45:38] Amir1: let me finish please [13:45:55] sorry, stopped [13:45:59] on a more mw-php-side, not sure if it is ok to treat an error like a miss, which I would suppose it is what is happening here (?) [13:46:32] jynus: please let me finish [13:47:51] so there were many errors due to mediawiki trying to connect to 127.0.0.1 [13:50:11] For now I am looking at this pattern of parsoid failures https://logstash.wikimedia.org/goto/ff769b28684358cbd696cfc8a6e7296b, which seems to have stopped [13:51:29] I will dig deeper, but looking gat the memcached traffic dropping during the outage, and the amount of errors, we were flying without memcached [13:51:38] effie: are you suggesting that the 8.1 version has this address bug and the old one not? [13:52:23] so when deploying we start flying basically without memcached and all this happens? [13:52:52] first lets say that memcached and PHP 8.1 in mw-web and mw-api-ext are doing great, unless there is somethinhg I am not seeing, so please let me know if you see otherwise [13:53:40] yeah, it seems to be only these three cluster: jobrunner, api-int, parsoid [13:54:17] secondly, this is not a php8.1 and memcached issue, to by understanding so far, but take this with a grain of salt [13:54:56] I have to go to an interview, I will hold off reverting the job change for now since that could affect us [13:54:57] I'm trying to understand in your scenario 1) what triggers the change and 2) if we're back in a good state why ES doesn't recover, like it has to re-create all the misses back into the caches [13:55:04] the variable holding the memcached address is in a file tha we mount on the container [13:56:26] volans: let me check a couple of things and I will get back to you on that [13:57:10] k [13:58:25] Until when are we ok waiting to see if ES hosts recover? [13:58:46] We should probably try to set a deadline and start a plan of action if they've not recovered by that time [13:58:53] es2037 shows no trace of recovery [14:01:02] \o [14:01:10] marostegui: but has any user-facing impact? [14:01:58] <_joe_> mszabo: "item too big", UHHHH [14:01:58] or can hold it for a while like that, we're just with less room for more traffic [14:02:10] _joe_: yeah but it's a bit low volume [14:02:36] volans: It doesn't no, but there's clearly something that has radically changed and I don't think it can be ignored [14:02:39] I see memcached failures only on parsoid on codfw [14:02:49] so I will restart those pods if there are no objections [14:02:50] FYI backport+config window has started in #wikimedia-operations, let me know if we shouldn’t be deploying right now… [14:02:51] sry meeting [14:02:58] absolutely, I'm just trying to nderstand if few hours is acceptable or not [14:03:14] volans: I think so yeah, but it is easy to forget [14:03:17] Lucas_WMDE: We would like you to hold on for now [14:03:28] ack [14:03:38] _joe_: any objctions? [14:04:00] <_joe_> effie: no I agree [14:04:16] alright, volans I am restarting the pods on parsoid [14:04:22] k [14:04:26] (FYI nothing happened from my side, you stopped me well before the change had merged and scap started doing stuff) [14:04:43] effie: codfw only? [14:05:48] yes [14:08:01] I think that is the problem [14:08:03] https://usercontent.irccloud-cdn.com/file/1P8Er5c9/image.png [14:08:49] <_joe_> effie: ? [14:09:07] I see some recovery on ES cluster, both on the specific graphs and the aggregate ones [14:09:25] <_joe_> effie: have you done anything? [14:09:25] volans: correct [14:09:31] effie: there's an important fix scheduled for this deploy window (visual editor is broken), do we think we'll be able to deploy later today? [14:09:33] but we got GatewayBackendErrorsHigh: api-gateway: elevated 5xx errors from rate_limit_cluster [14:09:40] hnowlan: did you find anything earlier? [14:09:49] kamila_: hopefully yes [14:09:55] thanks volans <3 [14:09:57] but not "right now" [14:10:00] ack [14:10:00] <_joe_> kamila_: not for effie, but the answer is "we'll see" [14:10:35] es2037 has recovered [14:11:08] <_joe_> logstash keeps timing out [14:11:16] _joe_: my first guess is that, we started rolling our php7.4 images mounting the wrong vars file [14:11:16] <_joe_> it's a bit hard to debug stuff this way [14:11:26] anyone from o11y here? [14:11:30] <_joe_> effie: uhhh why? [14:12:14] _joe_: complete guesswork, either the original patches for the rollout have an error we have missed [14:12:43] <_joe_> effie: in any case, please let's try re-rolling out your changes one by one, and maybe in more steps [14:13:09] I will have another go with scott during the late window [14:13:38] I suspect there is also something not ok with merging the fuckton of yaml [14:15:07] I think that also means that the incident report for yesterday and today, will have as many pages as crime and punishment [14:15:11] <_joe_> effie: that does not look probable to me [14:15:26] _joe_: I said complete guesswork didnt I [14:17:05] marostegui: is ES ok now? [14:17:24] fully recovered [14:17:39] effie: yes [14:17:41] <_joe_> did it recover as a consequence of something we did? [14:17:47] <_joe_> or just recovered by itself? [14:17:49] yes restart of parsoid pods on codfw [14:18:07] _joe_: I have no idea [14:18:12] it was the parsoid restart [14:18:14] <_joe_> didn't we *already* do it as I recommended for every deployment? [14:20:51] this is on me, going up the SAL, parsoid was never restarted because we also added more pods on parsoid [14:21:38] :/ [14:27:36] <_joe_> can we let people deploy? [14:27:44] dotting ts [14:27:50] and crossing is [14:28:20] <_joe_> yeah we're on the clock with letting them deploy though [14:28:30] <_joe_> volans: aren't you IC? [14:28:59] _joe_: yes and we deferred to serviceops earlier above to decide for the deploy ok/ko [14:29:08] from what I can see I think we're ok to let the deploy go [14:29:10] <_joe_> ok so let's go. [14:29:16] and I would also resolve the incident [14:29:34] y [14:29:39] yes [14:30:07] <_joe_> yes please [14:30:23] done [14:30:45] thanks y'all [14:31:14] effie: can I leave it to you to write some lines at the top of the document to summarize the last discoveries? [14:31:24] volans: I will write the whole thing [14:31:37] <3 [14:31:56] I will come back to and just put a really short version, but I will write the report and all next week [14:32:39] I think 4-5 lines at the top are enough for now to clarify the source of the issue yes [14:32:52] no need to write a book :D [14:37:54] <_joe_> did we restart the job amir disabled? [14:38:28] no amir didn't restart it because he had to go into an interview was planning to restart as soon as he's back [14:38:55] and we still have: api-gateway: elevated 5xx errors from rate_limit_cluster in eqiad that didn't recover [14:39:06] hnowlan: any finding on that from before? [14:40:17] volans: it deserves a post morten doesnt it [14:41:34] I said "for now" [14:41:38] ;) [14:48:40] volans: nothing yet, service is very quiet. I'll try to get something out of the logs [14:48:50] it's still functioning so it's at worst noise [14:49:00] paging noise :) [14:49:11] I was trying to follow wikitech advice but I can't only see some notice logs [14:49:30] if I filter out the notices I get nothing [14:49:32] oh, yeah, shouldn't page at all :| [14:50:05] sadly with envoy the options are between base access log or debug firehose [14:50:26] weird that it's coinciding with these issues though, can't see any issues with redis right now although I do see a drop in connections when the initial issues hit [14:51:31] we're actually seeing timeouts hitting the rate limit service itself [14:51:35] the dashboars says 5xx, can't we filter by 5xx on logstash? [14:51:44] I'm failing to see the field there [14:51:48] yes [14:51:51] it's in the page message [14:51:53] rate_limit_cluster [14:54:17] I'm going to roll_restart the cluster [14:54:34] ack [14:54:34] just to stop the pages hopefully [14:54:43] <3 [14:57:37] inflatador: when you get a chance, for your eyes https://gerrit.wikimedia.org/r/c/operations/puppet/+/1126486 [14:59:06] godog: FYI there were some questions for o11y in the backlog during the incident [14:59:23] volans: hah! thank you, reading now [14:59:42] I somewhat suspect these aren't timeouts to the rate limit service at all, they're timeouts caused when trying to connect to a service backend that uses rate limiting [15:00:22] and we don't get the error at the service level bur the rate limiter? [15:01:27] there are errors at the service level also, they trend almost exactly like the rate limiter which is what's making me suspicious [15:02:01] https://grafana.wikimedia.org/goto/RIUYUdhNg?orgId=1 [15:02:27] you think the lw_inference_reference_need_cluster [15:02:31] yeah [15:02:34] I've brought it up with ml [15:02:57] if I pick 2 days they have fairly different graphs [15:03:57] (page just resolved fwiw) [15:04:00] it's a little weird that the alert is for rate_limit_cluster but the errors are higher for lw_inference_reference_need_cluster [15:04:08] I'll silence it if it comes back [15:04:25] hnowlan: thealert has [x3] or [x2] that means more than one firing but aggregated [15:04:34] and AM picks opne of the names [15:04:47] yes it's confusing [15:04:56] the wikifeeds alert from earlier was legit btw [15:10:31] <_joe_> wikifeeds was a side-effect of the mw outage [15:10:54] yeah [15:11:46] Luca has bumped resources on the affected lw cluster, rate limit errors are down (which... I don't quite get but I'll take) [15:12:08] need to get better metrics out of that thing, they've added prometheus support [15:15:15] FWIW re: grafana/thanos overload, related task is T385693 and there's now enough data in the missing recording rule that we can begin swapping them in dashboards [15:15:15] T385693: thanos-query overload due to heavy queries - https://phabricator.wikimedia.org/T385693 [15:15:26] hnowlan: I don't have a lot of context but the ml-serve clusters do have some per-istio-sidecar rate-limit in place, it should be relatively high but not sure if it played a role (I haven't touched it in a while) [15:15:45] godog ACK, just added +1 [15:15:48] re: logstash timeout, WFM now though I'd imagine possibly some heavy queries there too? not 100% sure [15:15:51] inflatador: thank you! [15:16:16] np [15:17:49] would y'all find it useful to expose memcached errors encountered by mediawiki as a metric? [15:18:41] elukey: in this case we're seeing the errors from the ratelimit service inside of the api-gateway, which is a little baffling as it should just be querying redis via nutcracker [15:18:57] ahhh okok sorry then I'll shut up [15:19:09] nono, useful info! [15:21:19] both gateway service errors trending in a good direction, phew [15:21:36] ticketed work to get better insight/a more recent version of the rate limit service [15:40:42] back from the meeting. Sorry, kamila_ [15:41:12] hnowlan: paged again FYI, this time just wikifeeds_cluster [15:41:22] can I help? [15:41:30] and we have a big spike like earlier today [15:42:58] np jynus [15:44:11] status: e.ffie++ figured out the es overload problem [15:44:40] kamila_: what was it? [15:44:47] looking volans - probably a real problem [15:45:00] https://grafana.wikimedia.org/goto/9aJMudhNR?orgId=1 [15:45:05] So I can update the doc [15:45:57] /etc/php8.1 vs /etc/php7.4 in the path in the image; e.ffie said she'd update the doc next week [15:46:09] ok, good that it was found [15:46:33] thanks, just trying to catch up with the latest developemnts, you can go back to your normal work [15:46:37] taking back oncall [15:47:09] looks like there was a spike in wikifeeds errors on the service also [15:48:01] and a big spike in requests [15:48:35] I've put the latest update on top and removed my comment [15:49:41] if the error rate follows the request rate than there is nothing "broken" more than usual, just more traffic hence more errors [15:49:56] are we alerting on absolute values or rate of errros? [15:50:31] <_joe_> mszabo: we used to have that metric and to alert on it [15:51:30] hnowlan: yeah the traffic and error rate follow the same pattern, so I guess we alert on absolute 5xx/s and not in % of the requests hence the page. [15:52:05] those 5xx is wikifeeds? [15:52:12] yes [15:52:22] thanks, I think I finally landed [15:53:07] https://grafana.wikimedia.org/goto/Tdbbud2Ng?orgId=1 upstream "row" (close the other) first 2 graphs, select wikifeeds_cluster in both [15:54:14] the alert uses `sum(irate(envoy_cluster_upstream_rq{kubernetes_namespace=~"(api|rest)-gateway", envoy_response_code=~"5[0-9]+"}[5m]))`, but the threshold is 5 which is a little low I guess [15:56:34] yeah, looking at the graphs, the error spikes should be enough to "notify the app owner" but not enough to p* us, it is just 5 errors per second [15:56:50] IMHO [15:58:19] yeah [16:18:21] FYI, I believe we've figured out what the issue was during the earlier 8.1 migration attempt, and are planning to try again at some point in the next 1h40m [16:18:39] cwhite: urandom: FYI ^ [16:18:48] e.ffie and I will keep you posted [16:18:57] or the pages will :D [16:19:02] *pager [16:19:03] lol [16:19:14] hopefully not [16:19:18] :) [16:19:41] :D [16:34:19] cwhite: logshash has been quite slow today, but even after the incident, are you aware of anything going on ? [16:41:01] Puppet is broken since three hours on the deployment servers, does that ring a bell to anyone? [16:41:03] Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Operator '[]' is not applicable to an Undef Value. (file: /srv/puppet_code/environments/production/modules/profile/functions/kubernetes/deployment_server/elasticsearch_external_services_config.pp, line: 32, column: 21) on node deploy2002.codfw.wmnet [16:41:48] moritzm sounds like it's my fault, let me check [16:44:42] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1125234 removed the last host with role(elasticsearch::cloudelastic)...that may have affected the external services list. brouberol does that seem plausible? [16:45:24] I think it's ok. I don't think that external service is being used anywhere [16:45:32] let me check [16:47:11] a quick search in deployment-charts yielded no match for ES external services [16:51:35] I'm just wondering if the puppet code matched on cloudelastic (which still has some hieradata related to elasticsearch), but then returned an empty list for cluster members since there are no elastic hosts anymore. Maybe something like that is confounding Puppet [16:55:55] brouberol would you mind if we remove elasticsearch_external_services_config.pp, since no one is using it? We can replace it with an opensearch config once our migration is done [16:59:45] sure [17:00:20] cool, let me get a patch started [17:06:14] inflatador: please sort it out asap, we are about to deploy [17:07:46] it is blocking us [17:08:12] effie I get that. I whipped up a patch, LMK if it looks good https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127560 [17:09:25] looks like it needs more work. revising... [17:11:11] inflatador: can we revert the original patch ? [17:11:50] effie: we're still trying to consume the queue from the incident earlier today. millions of copies of `Memcached error: SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY` [17:13:45] looks like eqiad finished consuming the backlog a few minutes ago. searches will still be a bit slower due to the larger-than-normal indexes [17:13:57] alright thank you very much for the update [17:14:30] looking at the rate limit page :( [17:16:05] effie I'd prefer to roll forward, as this code needs to be removed regardless or it'll be a time bomb. PCC is happy now, I can merge if that works for y'all [17:16:31] inflatador: the PCC diff shows *a lot* being removed [17:16:57] hnowlan: thanks! lmk what you find out in case it fires after-hours [17:16:57] as scott said, the diff is overwhelming [17:17:07] swfrench-wmf yeah, it looks like it's removing every single IP for every single Elastic hosts...which is ~120 servers [17:17:22] It looks OK to me [17:18:08] just to confirm, this is because no k8s service actually uses these? is that correct? [17:18:15] keep in mind that no one is actually using this [17:18:50] I think the problem arose because we migrated 100% of cloudelastic to the opensearch role earlier today, and the query used by puppet started returning empty for cloudelastic [17:19:20] we **can** rollback if y'all prefer, but this is just going to happen again as we continue our migration [17:19:29] inflatador: it's the "no one is actually using this" part that I'm trying to confirm :) [17:20:17] swfrench-wmf Good point. I think a rollback might be more prudent [17:20:22] Let me get that started [17:20:22] inflatador: we would be very grateful if you would please revert, which would allow us to move forward with our scheduled migration [17:20:43] and we promise, we will be delighted if you'd pick this up afterwards :) [17:21:58] effie ACK, reverted via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1127565 [17:22:28] ah, I think this is what brouberol was saying above - i.e., no use of these external services endpoints lists by any existing k8s service [17:22:59] swfrench-wmf MW itself doesn't use this, does it? [17:24:36] it sounds like no, but that's the only service or services I'd really worry about [17:24:54] AFAICT, no [17:25:12] same here, I'd expect it to use the LVS endpoints [17:25:46] mostly as an artifact of the external_services not yet being adopted for network policies in MW [17:25:50] :) [17:26:01] LOL [17:26:33] also, elastic is one of the last (maybe the last) not to use Envoy for TLS termination. We'll probably start work on that once we're on Opensearch [17:28:35] inflatador: you have puppet-merged that revert, correcT? [17:28:49] inflatador: I will run it ok [17:28:54] ? [17:29:04] inflatador: ? [17:29:48] I merged it [17:30:38] swfrench-wmf effie sorry, forgot to puppet-merge ;( [17:32:08] inflatador: it is still producing an error I am afraid [17:32:26] I am running puppet again, but still [17:34:23] effie is it the same error? [17:35:01] yes, may needs to run puppet on other hosts? [17:35:52] inflatador: I'm wondering if, given the way the PQL in the external services "builder" logic works, implies that cloudelastic1012 needs to have a puppet run [17:35:56] effie I can try, but it will definitely err when I run against cloudelastic1012 [17:36:09] I am running puppet there now [17:36:18] I just started it [17:36:41] it will try and install elasticsearch on top of opensearch, which should be...interesting ;P [17:37:29] don't foresee any problems besides a puppet failure, but I'm banning it just in case [17:43:00] effie how's it going? I'd expect Puppet to fail on cloudelastic1012, but maybe just running it will create the structure needed for Puppet to succeed on the deployment server? [17:43:11] inflatador: running puppet on 1012 moved the deployments puppet to move forward [17:43:41] effie excellent, hit me up when it's done and we'll work on getting rid of that code permanently [17:44:41] inflatador: beats me [17:44:43] https://usercontent.irccloud-cdn.com/file/n6Cl9OzX/image.png [17:45:49] WOW! that ran without any errors? Was not expecting that [17:46:21] Still gonna have to reimage, as it will be a disaster whenever a service tries to restart, but LOL [18:26:50] inflatador: deployment is done, so I think you're good to move ahead with retiring the external services support for elastic search, and then the role switch / reimage [18:27:41] swfrench-wmf thanks for the update, will move fwd and keep y'all posted [18:27:52] ack, thanks! [18:27:58] status update on the PHP 8.1 front, this is now done - no issues encountered this time around [18:28:20] I'll post a summary to the incident doc shortly [18:30:06] merged/running puppet on deploy2002 now [18:40:12] deploy2002 puppet run says `Error: Could not send report: Error 500 on SERVER: Server Error: Could not autoload puppet/reports/logstash: Cannot invoke "jnr.netdb.Service.getName()" because "service" is null` . I'll run again, but if anyone has ideas LMK [18:40:59] this is "expected" https://phabricator.wikimedia.org/T388629 [18:41:10] independent of any issues you might be seeing, that is [18:41:41] Ah, thanks sukhe . In that case, everything seems good [19:56:47] For those of y'all who did the appserver-to-wikikube reimages, how often did y'all run into reimage failures? I ask because we're about to reimage ~110 hosts, and out of the 8 we've done so far, 4 have had issues that seemingly went away when I updated the firmware. Just wondering if it's worth the effort to proactively update firmware on all the hosts [19:58:25] I didn't do any of the wikikube reimages per se, but I've run into my fair share that failed until I was on an updated firmware [19:59:14] you're at 50%, that sounds about right :) [20:01:25] dear SREs, has the root cause for today's incident fully addressed? I want to re-enable the job I disabled today since this will have impact on editors [20:01:30] I had to downgrade firmware for 80 of our hosts a couple years back, so I might dust off this playbook and see if I can get it to work again https://gitlab.wikimedia.org/repos/search-platform/sre/stage-firmware-update/-/tree/main [20:28:41] Amir1: yes do so please, the root cause was a deployment that basically had api-int, parsoid, and jonrunnings running without memcached [20:29:22] yeah, I wanted to make sure that is fixed for good before adding load [20:29:23] the memcached errors alerts didnt alert us today either [20:30:39] I promised volan.s that I would right something on the doc, but by the time scott and I were done with the 8.1 rollout, it was 8pm here [20:31:46] Amir1: it has been fixed since 14.20 UTC or so [20:32:14] I am off [20:32:33] enjoy your evening [20:50:43] inflatador: I don't recall having firmware version issues, though I had others [20:52:27] kamila_ ACK, I've had repeated problems with the host hanging at the PXE boot screen, did you have any problems like that? FWiW I've been reimaging R450s so far [20:53:17] only with 10G NICs, not on any of the "normal" appservers [20:56:45] Amir1: scott added a tl;dr :) [21:28:40] All our hosts have 10G NICs :S