[02:02:46] * andrewbogott merged it [03:30:13] _joe_: i'm speculating, but if it's anything like the other (potentially unrelated) networking issues, I guess a repeated 250ms timeout would do the trick. [03:30:56] not 250ms, that's what envoy-shellbox used. I don't know the timeout here offhand, but afaik we don't know the upperbound of what that timeout truly is, possibly packet loss / infinite. [03:32:01] (alexandros was looking into that at some point) [09:13:08] hi all back from vacation, just going through backlog, please bing if there is something to prioritise [09:17:41] welcome back! [09:17:42] welcome back jbond! codfw is primary (including deploy and mwconfig hosts among other things) ;) [09:18:19] ack and thanks [09:19:56] speaking of which, now that deploy1002 isn't active, can Service Ops make sure it gets moved (cc akosiaris but iirc your on vacation soon) - https://phabricator.wikimedia.org/T308339 [09:20:43] we already missed the shot 5 months ago :) [09:20:51] <_joe_> XioNoX: it's definitely moved now [09:21:28] <_joe_> XioNoX: it's not clear to me what can I do to make sure it gets moved [09:21:47] <_joe_> XioNoX: deploy2002 will stay primary for 6 months though [09:22:02] <_joe_> so we're not on a rush [09:22:26] yeah, but seeing what happened last time I'd rather it gets done sooner than later :) [09:23:46] _joe_: multiple options depending on the host, either sync up with dcops, or power it off (or get it in a state it can be powered off by dcops) and mention it in the task [09:25:48] btullis, in case you missed https://phabricator.wikimedia.org/T308339 would it be possible to briefly power down an-tool1010 so dcops move it to a different rack of the same row (cc brouberol)? [09:26:51] jclark-ctr: when would be a good time to tackle the last 3 hosts of https://phabricator.wikimedia.org/T308339 ? [09:29:38] AFAICS this is where we run Superset. btullis: are we good to shut it down for a bit? [09:39:29] XioNoX: Thanks, I had forgotten about that ticket. I'd prefer to let our users know in advance, but could do this afternoon or early next week. Whatever works for jclark-ctr as long as we have a bit of warning. [09:40:29] <_joe_> brouberol: it's both superset and turnilo? [09:40:44] <_joe_> I would be ok with those being down for a bit [09:41:03] btullis: cool, yeah no real rush, but as we're getting traction with the switchover better do it sooner than later (with appropriate notice) [09:41:25] It's the product analytics team I was thinking of, mainly. [09:41:38] and we're talking of a 5min downtime (jclark-ctr to confirm) [09:43:31] OK, cool. Will coordinate on the ticket. turnilo is on an-tool1007 so won't be affected. [09:44:51] thx [09:48:55] XioNoX: thanks, I 've pinged dcops on the task [09:49:05] I had forgotten about that [09:49:11] eh :) [09:49:13] thanks [10:24:30] @XioNoX: @btullis early next week I should be free I am trying to get caught up with installs right now [10:25:44] jclark-ctr: nice, can you set a time/date that suits you in the task so everybody is on the same page? [10:34:04] welcome back [11:45:51] hi cwhite! could you please help us (Growth team) with https://phabricator.wikimedia.org/T344428#9173651? What seems to be needed at this point is someone to check the system logs, whether there is any mention of exhausting file-related limits, and possibly/ideally also to advice about other potential causes/things to check. Could you help us with that, or route it to an appropriate SRE? Thanks! [12:04:17] btullis: I think I fixed the deployment server issue, do you know what I could safely try deploying to make sure? [12:08:31] kamila_: Great, thanks. What did you do, out of interest? I've been working with deploying this https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/957938 to an-tool1005 (which is staging) as per https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Superset/Administration#Deploy_to_staging [12:09:32] btullis: I ran `sudo bash -c 'find /srv/deployment -name DEPLOY_HEAD | xargs sed -i "s/git_server: deploy1002.eqiad.wmnet/git_server: deploy2002.codfw.wmnet/"'` (and will go and feel ashamed once I verify it) :D [12:09:34] thanks! [12:09:48] I could create a patchset 13 for you, if you like? You could then try deploying that from deploy2002 while deploy1002 is set back to the master branch. [12:10:28] sure, thanks [12:10:59] (also https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Superset/Administration#Deploy_to_staging should probably be changed to refer to deploy2002, ok to do that?) [12:11:36] (or is it sufficiently obvious?) [12:12:24] kamila_: or `deployment.eqiad.wmnet`, which resolves to eqiad/codfw as appropriate. [12:12:37] yeah, that's better, thanks urbanecm [12:12:45] OK, give me a few minutes for a new patchset. [12:12:50] thank you! [12:12:51] (having that in `.eqiad.wmnet` if it's not always in eqiad seems confusing :/) [12:12:57] agreed, but... [12:14:15] granted, deployment.codfw.wmnet works equally well [12:14:57] is that true? I do not remember doing that :D [12:15:24] I think deployment.codfw.wmnet doesn't exist [12:16:04] `host deployment.codfw.wmnet ns1.wikimedia.org` claims it does [12:16:37] yes it does https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wmnet#233 [12:17:07] ah, right, I didn't change that because it was already set to codfw [12:17:09] you're right [12:17:36] I guess my question is why we don't just have a single record under .discovery.wmnet [12:17:46] probably historical reasons [12:18:33] which means somebody should re-think it, I'm not sure who though [12:18:55] or should I just go ahead and add it? it's not like adding a new thing will break anything... [12:19:33] fwiw on wikitech to mention the cumin hosts I use a template that is then include in all pages referencing them https://wikitech.wikimedia.org/wiki/Template:CuminHosts [12:20:42] ah, that's good to know, thanks volans [12:21:07] that said we do have some exceptions in the dicovery ORIGIN in dns that point directly to a host although I think we were supposed to avoid them, but that ship has sailed :D [12:22:05] right [12:24:26] kamila_: There is now a patchset 13 on https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/957938 - You can do the deploy to an-tool1005 as per those instructions above. [12:24:35] thank you btullis! [12:24:57] I have reset deploy1002:/srv/deployment/analytics/superset/deploy back to the master branch. [12:32:10] https://www.irccloud.com/pastebin/b2AO4s8s/ [12:32:20] btullis: looks good to me ^ [12:34:05] and it is the right commit [12:45:29] There's an email to the sre list about where a particular email address goes. AFAICT our mxs will just send it to google - so would ITS know how a particular email is dealt with in our google account? [12:48:49] Emperor: not sure for @wikipedia emails [12:49:35] Emperor: there's an exim command that can be used to check where our mx hosts will route a specific email, https://wikitech.wikimedia.org/wiki/Exim#test_address_routing [12:51:21] taavi: yes, I know, which is how I know mx1001 will send it to google (router = gsuite_account, transport = remote_smtp) [12:51:49] XioNoX: exim -bt tells me it's routed to the same localpart at wikimedia.org and sent on to google, but it's what happens then I don't know [12:52:12] taavi: sorry, that came over tetchier than I intended [12:52:57] worth following up with that info then, or try to send an email to this address asking who gets its :) [12:53:19] I'll reply and CC ITS (who I think likely will know) :) [12:53:45] oh, wait, no, ITS already sent them our way. [12:53:45] +1 [12:54:01] kamila_: Many thanks. [12:54:11] I think I'm going to CC them anyway, I think they will likely know about the gsuite accounts [12:55:25] btullis: yw, sorry for breaking it :-D I'll make sure it's handled next time (not sure how, but somehow) [13:18:25] kamila_: no problem at all :-) [13:32:49] <_joe_> kamila_: I think just having a script on the deployment hosts that we can launch as part of the switchover would be enough [13:34:53] Okay, thanks _joe_, will do [14:39:50] !incidents [14:39:50] 4072 (UNACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw) [14:40:05] <_joe_> herron: uhm [14:40:26] <_joe_> let me take a quick look, in the mean time you can check the parsoid slow parse dashboard in logstash [14:40:34] thanks _joe_ [14:40:40] !ack 4072 [14:40:40] 4072 (ACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw) [14:41:12] <_joe_> and unsurprisingly, it's a flurry of zhwiki requests again [14:41:29] <_joe_> akosiaris: when did you disable the requestctl rule? [14:41:41] hmm, lemme check [14:42:13] 11:24 UTC [14:42:18] jelto: here ^ [14:42:21] !incidents [14:42:22] 4072 (ACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw) [14:42:22] ehmm sorry local, so 08:24 UTC [14:42:22] 4073 (UNACKED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [14:42:38] https://phabricator.wikimedia.org/T346657#9185948 [14:42:44] _joe_: re-enable ? [14:42:46] !ack 4073 [14:42:47] 4073 (ACKED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [14:42:59] <_joe_> akosiaris: I don't think it's the reason this time, let me check though [14:43:23] <_joe_> yeah nevermind, reenable :/ [14:43:29] or you could re-enable it in log only mode and see if it matches [14:43:33] ah nevermind :D [14:43:42] Here if I can help with anything. [14:43:59] <_joe_> volans: it's a very small number of requests AIUI [14:44:03] done [14:44:22] <_joe_> akosiaris: unless this time it's really mobileapps and not wikifeeds [14:45:43] thanks for enabling the rule, let's see if that helps [14:45:43] <_joe_> yeah latency going down [14:45:57] <_joe_> nemo-yiannis: ^^ :/ [14:47:19] !incidents [14:47:20] 4072 (ACKED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw) [14:47:20] 4073 (RESOLVED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [14:49:32] !incidents [14:49:32] 4072 (RESOLVED) PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw) [14:49:33] 4073 (RESOLVED) ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw) [14:49:38] both resolved again [14:51:00] yeah indeed its the same behaviour with language conversion on parsoid :/ [14:51:48] <_joe_> nemo-yiannis: the worst part is - this was 5 rps [14:51:58] <_joe_> to wikifeeds I mean [14:52:16] <_joe_> it's the fanout effect I'm worried about. [14:53:52] What we did was to strip all `accept-language` headers that start with `zh-` to `zh` so restbase wont call pagebundle/to/pagebundle but it looks like this wasn't enough [14:54:08] Do you have any idea how many requests caused the same issue last time? [14:54:30] <_joe_> it was about 3x what we got now [14:54:35] ok [14:54:59] <_joe_> requests to wikifeeds I mean [14:55:43] maybe instead of replacing accept language from `zh-*` to `zh` we should force `zh` to all requests [14:55:57] or the problem is not on pagebundle/to/pagebundle [14:56:15] <_joe_> actually that's what I see flooding parsoid in those moments [14:56:21] <_joe_> so I do think that's the problem [14:56:32] <_joe_> we had a flurry of such requests during the incident [14:56:39] <_joe_> let me count how many on the parsoid hosts [14:59:48] <_joe_> we got about 1.2k requests per minute for http://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle [14:59:51] <_joe_> per server [15:00:11] <_joe_> so that's 20 rps per server, which makes it 400 rps [15:00:39] what causes the fanout explosion? [15:00:49] <_joe_> unless I'm missing something obvious, the biggest issue is there seems to be a large multiplication factor between wikifeeds and all the layers to parsoid [15:00:59] <_joe_> bblack: I suspect it's the original url in wikifeeds [15:01:11] <_joe_> it's a bundle of featured articles about a specific date [15:01:14] ok [15:01:26] <_joe_> so if there's say 50 articles, it probably fans out 50 requests [15:01:42] in the past (way back) the big multiplication we saw with parsoid/rb, it was because it was retrying failures [15:01:48] <_joe_> the road from the edge is [15:02:05] that was way back then, when we decided that nothing in our stack should retry a failure, except the outermost edge. [15:02:17] <_joe_> ats -> rest gateway -> wikifeeds -> restbase -> mobileapps -> restbase -> parsoid [15:02:18] (so you don't cascade retry counts at every layer) [15:03:16] <_joe_> I think the multiplication happens within wikifeeds [15:13:03] <_joe_> ok this is interesting - we have a large baseline of requests for pagebundle/to/pagebundle, around 11 rps per server [15:13:34] <_joe_> it went to about 20 rps per server during the incident [15:13:47] <_joe_> and most of these requests take over 20 seconds to respond to [15:16:49] _joe_: looks like we have telemetry in EtcdConfig already, but wmf-config isn't passing a $logger. Can't really, given it's pre-config. [15:17:16] we can't set up monolog or anything given that confnig isnt known yet at that point, and none of the services loadable either [15:18:16] given this class isn't likely to be used in another context, I think that means it's okay to safe in core and just make it use something more barebones instead, e.g. trigger_error or error_log etc and let native PHP handle it intead of through a dedicated logstash channel, which cant' work indeed [15:23:12] <_joe_> Krinkle: sorry, in the middle of another rabbithole :) [15:23:52] <_joe_> nemo-yiannis: so to recap, about 95% of POST requests to parsoid, and I suspect about 90% of the overall load, comes from this single source [15:24:57] <_joe_> and even with the ban on wikifeeds, there's plenty of requests of that type coming from mobileapps [16:16:26] got it, thanks _joe_ i am looking at it as we speak [17:16:41] _joe_: The root cause is indeed the summary endpoint. For each wikifeed requests we send ~10s reqs to parsoid for `/page/html` and ~10s reqs to `/page/summary` (that then calls parsoid). The workaround I deployed today patched partially the zhwiki accept-language for`/page/html` requests, but we keep sending the requests to `/page/summary` with the locales that cause the issue (zh-). [17:17:03] I suggest we wait for the proper fix on parsoid level on next train: https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/958593 [20:03:32] <_joe_> nemo-yiannis: ack, makes sense [20:03:40] <_joe_> also, eek [20:36:51] does anyone know if our typical Dell chassis come with hot-swappable storage? [20:54:48] yes, AFAIK all our servers have hot-swappable disks [20:55:18] at least I'm not aware of exceptions [21:16:49] ACK, thanks v-olans...confirmed in dcops too [21:30:37] urbanecm: submitted too soon, updated my comment :) [21:31:17] cwhite: thanks! seems i'm replying to comments too early :D [21:31:45] can you help me with interpreting what it means please? [21:35:34] That's probably a better question for serviceops. Afaik the slowlog captures traces of long-running requests it detects. The exact details I'm not well-versed on, but the stack can show what the process is waiting on. [21:39:06] okay, thanks cwhite. do you know if there is there something that might explain why the process is not able to read an existing file it has permissions for? [21:39:29] (if there's nothing obvious, i can ask serviceops, but i can't see the logs myself it seems) [21:42:51] nothing obvious, I'm afraid :( [21:43:29] here is all logs from that host around the time of the Cdb\Exception though: https://logstash.wikimedia.org/goto/8fe070310612f8b8640d644a33dbd35f [21:44:25] thank you