[06:42:59] <_joe_> Krinkle: I'm not sure what you are trying to measure, but remember prometheus is pull not push [06:43:43] <_joe_> so what ends up in the timeseries will be marked at the time of the pull, for all labels [06:45:20] <_joe_> I'm not sure I understand what your problem is, because reading what you wrote above, I'd say you're looking at the problem from the wrong angle [10:25:40] Hi! I am suddenly getting 400 errors for all Rest API calls in my spamcheck tool. E.g. for https://en.wikipedia.org/api/rest_v1/page/html/Veljko_Ra%C5%BEnatovi%C4%87/1212334420?redirect=false [10:25:49] Is there an outage? [10:26:02] https://phabricator.wikimedia.org/T359509 [10:34:56] There seems to be an increase in 400s: https://grafana.wikimedia.org/d/t_x3DEu4k/parsoid-health?orgId=1&refresh=15m&from=now-3h&to=now [10:36:03] also: https://grafana.wikimedia.org/goto/ES666h0Ik?orgId=1 [10:37:07] Tagging the content-transform team on this. [10:37:52] https://grafana.wikimedia.org/goto/wEoZRTAIz?orgId=1 [10:38:04] ihurbain: nemo-yiannis: you around? [10:38:08] issue started around 9:12 UTC [10:38:14] hey claime [10:38:57] jynus: that's around when the train ran [10:39:10] anything shipped in the train this morning wrt parsoid? [10:39:10] yeah [10:39:36] we've had a low level of 400s from yesterday actually [10:39:43] and it just jumped with the train [10:39:47] jnuche: ^^ [10:41:03] let's ping also eoghan and XioNoX so they are aware [10:42:00] I'm around if a train rollback is needed [10:43:21] jynus: thanks! [10:43:50] I think it is jnuche, we're serving only 400s from parsoid right now [10:44:07] rolling back [10:46:40] do we have a ticket for this or should i file one ? [10:47:04] looks like the ticket is https://phabricator.wikimedia.org/T359509 [10:47:05] nvm: https://phabricator.wikimedia.org/T359509 [10:47:08] yeah [10:54:54] <_joe_> the timing of the increases of 400s from parsoid coincides with the train rolling out to group 1 and group 2 [10:55:03] <_joe_> so yes, a rollback should fix the issue [10:58:04] rollback still running, image build took some time [10:59:03] as an additional thing, reviewing monitoring, as I saw no alarms related [10:59:13] for the service owners [11:00:09] At the SRE level we don't alert on 4xx [11:01:21] Should probably alert when non 200s are over 200s in volume [11:01:36] <_joe_> it's tricky [11:01:42] it is [11:01:44] <_joe_> but that is probably we could attempt [11:02:37] <_joe_> we can probably alert when code >= 400 / code <= 400 ~ 2 [11:02:58] fair [11:05:38] I added some comments after investigating the issue on the ticket [11:10:38] yeah, the increase of 400 doesn't worry me, as much as there seems there were very few 200 (?, I am not sure about this) [11:11:08] jynus: It's completely upside down [11:11:24] 11:09:51 Finished Running helmfile -e codfw --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 04m 26s) [11:11:24] 11:09:57 Finished Running helmfile -e eqiad --selector name=main apply in /srv/deployment-charts/helmfile.d/services/mw-parsoid (duration: 04m 31s) [11:11:31] 400s should start to go down now [11:11:41] Yeah 200s are climbing back up [11:13:21] <_joe_> uhm [11:13:28] @CountCount can you try now? [11:13:44] rollback complete [11:15:40] Or maybe not necesarilly make it an alert, but a signal after a deploy (?) I am throwing ideas to try to find the right solution [11:15:44] <_joe_> we're having a flurry of requests from the appserver clusters to restbase [11:15:48] <_joe_> jynus: we're not done [11:15:53] :-( [11:16:58] <_joe_> https://grafana.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-origin=appserver&var-origin_instance=All&var-destination=restbase&viewPanel=4 [11:18:33] <_joe_> uhm, and only on bare metal [11:18:45] that seems to me like a reoccurring spike: https://grafana.wikimedia.org/goto/c4ZhioAIk?orgId=1 [11:19:02] could be changeprop ? [11:19:18] <_joe_> nemo-yiannis: in eqiad? nope [11:19:37] <_joe_> the caller here is the "appserver" cluster, thus the website [11:19:56] <_joe_> but yes, it's a recurring event [11:20:00] this should fix the root cause: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009237 [11:20:10] and it used to happen on codfw: https://grafana.wikimedia.org/goto/0s80mTASz?orgId=1 [11:20:45] cc duesen [11:21:57] @jynus Works again for me. Thanks! [11:23:43] IMHO the UBN and issue can be considered gone then, followups pending (patch and alerting fixes) [11:24:17] <_joe_> yes it doesn't look like that flurry of requests were to parsoid but to mathoid btw [11:24:21] <_joe_> just time correlation [11:24:50] T359509 is still blocking the train until a fix can be backported [11:24:50] T359509: REST API calls suddenly all returning 400 - https://phabricator.wikimedia.org/T359509 [11:26:01] maybe a potential 400s/lack of 200-based alert can become a warning so it is noticed by deployers but cannot spam? idk [11:26:26] I think deployers should weight on that mainly [11:29:19] What is the status with the train? I wanted to restart GitLab in one hour at 12:30 UTC. Is that possible? [11:30:05] jelto: it's currently blocked at T359509 [11:30:05] T359509: REST API calls suddenly all returning 400 - https://phabricator.wikimedia.org/T359509 [11:34:02] ack, let me know if that conflicts with a GitLab restart at 12:30 UTC [11:34:32] will keep you posted 👍 [11:35:29] I'm disabling Puppet in codfw and the edges for approx 10m for a puppetserver reboot [11:46:38] puppet is re-enabled [12:02:25] inflatador: sre.puppet.sync-netbox-hiera is showing a diff for wdqs1025, expected? [12:03:49] there is cookbook stopped (as in ctrl+z'ed) for it :/ [12:04:14] taavi: it's a host being reimage so go ahead, not a blocker [12:04:22] but please ping j.clark in dcops about it [12:04:23] taavi: It's probably a bit early for inflata.dor but I suspect that it's fine. That server is related to work on this ticket: T358727 [12:04:24] T358727: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727 [12:05:22] thanks. so that's why it said 'releasing expired lock for bking@cumin2002' [12:05:26] i merged it [12:06:33] I know that host had some problem so probably multiple reimages, about the lock for the hiera I can check it later [12:06:36] not sure what was done there [12:06:56] sorry have to go afk for a bit now [12:25:41] jelto: I'm still waiting on a backport fix, so feel free to go ahead with the GitLab maintenance in a bit as scheduled [12:26:01] great thanks, I'll do and let you know when I'm donw [12:35:38] jnuche: GitLab maintenance done [12:35:52] ack, danke [13:01:19] There are still some URLs for which I get 400s, e.g. https://pl.wikiquote.org/api/rest_v1/page/html/Baldur's_Gate/565992?redirect=false [13:05:34] CountCount: Polish wikiquote is in group1, which still has the unpatched code [13:05:41] that should be solved as soon as we can backport a fix [13:06:09] jnuche: ok [13:06:24] speaking of which, nemo-yiannis I've put together a backport patch: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009500 [13:06:37] do you know who else we can ping to give it some traction? [13:11:10] jnuche: i am not very familiar with backporting patches. Isn't this enough? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009237 [13:11:25] maybe duesen ^ who is the original author [13:12:04] nemo-yiannis: nope, that patch targets master [13:12:21] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009500 targets the current train, which is what we need to unblock [13:13:24] I basically copied https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009237 over, and that's probably ok, but it still needs confirmation [13:14:22] duesen is already pinged, no worries, I'll wait a bit longer and then send something through the wider-audience channels, thx :) [13:15:44] (s/what we need to unblock/what we need in order to unblock/) [13:16:30] I +1 but as i mentioned i am not super familiar [13:19:40] np, waiting for someone who is more familiar just in case [13:19:41] thanks 👍 [14:01:44] _joe_: I'm aware it's pull instead of push, but for cgi applications / statsd-exporter the difference is externalised. [14:02:45] <_joe_> not from the point of view of the collector [14:02:56] <_joe_> which is where you perform the sum() [14:03:32] <_joe_> so AIUI, if statsd-exporter behaves correctly, once you send a gauge for a certain metric/set of labels, it should be reported as that value until you observe a new one [14:13:55] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009500 did not fix the train blocker [14:14:00] still seeing 400s here https://grafana.wikimedia.org/goto/KZMbFoASz?orgId=1 [14:14:09] and with requests like https://pl.wikiquote.org/api/rest_v1/page/html/Baldur's_Gate/565992?redirect=false [14:14:17] gonna flag this on slack and the mail list [16:02:35] backport of https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1009544 seems to have done the trick [16:02:50] this returns now no data though: https://grafana.wikimedia.org/goto/KZMbFoASz?orgId=1 [16:03:05] Yeah, i only tested a few manual requests, grafana doesn't work for me either [16:03:17] claime: ^ is that suspicious or some unrelated issue? [16:03:38] jnuche: I think it's https://phabricator.wikimedia.org/T343529 [16:03:41] godog you around ? [16:04:08] I'm still getting metrics from non-k8s prometheus [16:04:51] I'll bounce it, see if it fixes it [16:05:02] okie dokie, thx [16:05:08] this looks healthy enough to me: https://grafana.wikimedia.org/d/t_x3DEu4k/parsoid-health?orgId=1&refresh=15m&from=now-3h&to=now [16:05:13] gonna continue with the train [16:07:28] thanks jnuche [16:07:41] np [16:21:53] welp it's not coming back up and I'm out of ideas [17:01:15] sobanski: are you involved in coordination of the SRE summit? Or if not is there anyone here who is? [17:01:58] andrewbogott: yes, I am [17:02:56] Great! So up until a few days ago we'd hoped that arturo would be there as a wmcs delegate, but it turns out he can't attend. So I'm suddenly very interested in remote participation options :) Is there a place that folks on my team can sign up to get looped in? [17:08:55] I'll send the doc that has all the details your way [17:09:47] thank you! [17:27:56] jnuche: thanks for sticking this one out! [17:28:13] I'm about to end my day now. I hope we fixed it for good... [17:40:48] who/what team thinks about swift these days? [17:41:21] data-persistence (and me) [17:41:50] if it's not urgent, I'm about to vanish for some food, though :) [18:09:03] It's not urgent, I was just going to point you at the ceph doc which you've already seen [18:17:16] :) [22:09:39] !incidents [22:09:40] 4513 (UNACKED) db2124 (paged)/MariaDB Replica SQL: s6 (paged) [22:09:41] !ack 4513 [22:09:42] 4513 (ACKED) db2124 (paged)/MariaDB Replica SQL: s6 (paged) [22:17:05] !ack 4514 [22:17:06] 4514 (ACKED) db2124 (paged)/MariaDB Replica Lag: s6 (paged) [22:22:37] !resolve 4513 [22:22:38] 4513 (ACKED) db2124 (paged)/MariaDB Replica SQL: s6 (paged) [22:23:15] can't tell if that worked, let me click all the squares with a bus [22:23:38] resolved both in VO