[02:02:46] * andrewbogott merged it
[03:30:13] <Krinkle>	 _joe_: i'm speculating, but if it's anything like the other (potentially unrelated) networking issues, I guess a repeated 250ms timeout would do the trick.
[03:30:56] <Krinkle>	 not 250ms, that's what envoy-shellbox used. I don't know the timeout here offhand, but afaik we don't know the upperbound of what that timeout truly is, possibly packet loss / infinite.
[03:32:01] <Krinkle>	 (alexandros was looking into that at some point)
[09:13:08] <jbond>	 hi all back from vacation, just going through backlog, please bing if there is something to prioritise
[09:17:41] <XioNoX>	 welcome back!
[09:17:42] <volans>	 welcome back jbond! codfw is primary (including deploy and mwconfig hosts among other things) ;)
[09:18:19] <jbond>	 ack and thanks 
[09:19:56] <XioNoX>	 speaking of which, now that deploy1002 isn't active, can Service Ops make sure it gets moved (cc akosiaris but iirc your on vacation soon) - https://phabricator.wikimedia.org/T308339 
[09:20:43] <XioNoX>	 we already missed the shot 5 months ago :)
[09:20:51] <_joe_>	 XioNoX: it's definitely moved now
[09:21:28] <_joe_>	 XioNoX: it's not clear to me what can I do to make sure it gets moved
[09:21:47] <_joe_>	 XioNoX: deploy2002 will stay primary for 6 months though
[09:22:02] <_joe_>	 so we're not on a rush
[09:22:26] <XioNoX>	 yeah, but seeing what happened last time I'd rather it gets done sooner than later :)
[09:23:46] <XioNoX>	 _joe_: multiple options depending on the host, either sync up with dcops, or power it off (or get it in a state it can be powered off by dcops) and mention it in the task
[09:25:48] <XioNoX>	 btullis, in case you missed https://phabricator.wikimedia.org/T308339 would it be possible to briefly power down an-tool1010 so dcops move it to a different rack of the same row (cc brouberol)?
[09:26:51] <XioNoX>	 jclark-ctr: when would be a good time to tackle the last 3 hosts of https://phabricator.wikimedia.org/T308339 ?
[09:29:38] <brouberol>	 AFAICS this is where we run Superset. btullis: are we good to shut it down for a bit?
[09:39:29] <btullis>	 XioNoX: Thanks, I had forgotten about that ticket. I'd prefer to let our users know  in advance, but could do this afternoon or early next week. Whatever works for jclark-ctr as long as we have a bit of warning.
[09:40:29] <_joe_>	 brouberol: it's both superset and turnilo?
[09:40:44] <_joe_>	 I would be ok with those being down for a bit
[09:41:03] <XioNoX>	 btullis: cool, yeah no real rush, but as we're getting traction with the switchover better do it sooner than later (with appropriate notice)
[09:41:25] <btullis>	 It's the product analytics team I was thinking of, mainly.
[09:41:38] <XioNoX>	 and we're talking of a 5min downtime (jclark-ctr to confirm)
[09:43:31] <btullis>	 OK, cool. Will coordinate on the ticket. turnilo is on an-tool1007 so won't be affected.
[09:44:51] <XioNoX>	 thx
[09:48:55] <akosiaris>	 XioNoX: thanks, I 've pinged dcops on the task
[09:49:05] <akosiaris>	 I had forgotten about that
[09:49:11] <XioNoX>	 eh :)
[09:49:13] <XioNoX>	 thanks
[10:24:30] <jclark-ctr>	 @XioNoX: @btullis  early next week I should be free  I am trying to get caught up with installs right now
[10:25:44] <XioNoX>	 jclark-ctr: nice, can you set a time/date that suits you in the task so everybody is on the same page?
[10:34:04] <moritzm>	 welcome back
[11:45:51] <urbanecm>	 hi cwhite! could you please help us (Growth team) with https://phabricator.wikimedia.org/T344428#9173651? What seems to be needed at this point is someone to check the system logs, whether there is any mention of exhausting file-related limits, and possibly/ideally also to advice about other potential causes/things to check. Could you help us with that, or route it to an appropriate SRE? Thanks!
[12:04:17] <kamila_>	 btullis: I think I fixed the deployment server issue, do you know what I could safely try deploying to make sure?
[12:08:31] <btullis>	 kamila_: Great, thanks. What did you do, out of interest? I've been working with deploying this https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/957938 to an-tool1005 (which is staging) as per https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Superset/Administration#Deploy_to_staging
[12:09:32] <kamila_>	 btullis: I ran `sudo bash -c 'find /srv/deployment -name DEPLOY_HEAD | xargs sed -i "s/git_server: deploy1002.eqiad.wmnet/git_server: deploy2002.codfw.wmnet/"'` (and will go and feel ashamed once I verify it) :D 
[12:09:34] <kamila_>	 thanks!
[12:09:48] <btullis>	 I could create a patchset 13 for you, if you like? You could then try deploying that from deploy2002 while deploy1002 is set back to the master branch.
[12:10:28] <kamila_>	 sure, thanks
[12:10:59] <kamila_>	 (also https://wikitech.wikimedia.org/wiki/Data_Engineering/Systems/Superset/Administration#Deploy_to_staging should probably be changed to refer to deploy2002, ok to do that?)
[12:11:36] <kamila_>	 (or is it sufficiently obvious?)
[12:12:24] <urbanecm>	 kamila_: or `deployment.eqiad.wmnet`, which resolves to eqiad/codfw as appropriate.
[12:12:37] <kamila_>	 yeah, that's better, thanks urbanecm 
[12:12:45] <btullis>	 OK, give me a few minutes for a new patchset.
[12:12:50] <kamila_>	 thank you!
[12:12:51] <taavi>	 (having that in `.eqiad.wmnet` if it's not always in eqiad seems confusing :/)
[12:12:57] <urbanecm>	 agreed, but...
[12:14:15] <urbanecm>	 granted, deployment.codfw.wmnet works equally well
[12:14:57] <kamila_>	 is that true? I do not remember doing that :D 
[12:15:24] <kamila_>	 I think deployment.codfw.wmnet doesn't exist
[12:16:04] <urbanecm>	 `host deployment.codfw.wmnet ns1.wikimedia.org` claims it does
[12:16:37] <taavi>	 yes it does https://gerrit.wikimedia.org/r/plugins/gitiles/operations/dns/+/refs/heads/master/templates/wmnet#233
[12:17:07] <kamila_>	 ah, right, I didn't change that because it was already set to codfw
[12:17:09] <kamila_>	 you're right
[12:17:36] <taavi>	 I guess my question is why we don't just have a single record under .discovery.wmnet
[12:17:46] <kamila_>	 probably historical reasons
[12:18:33] <kamila_>	 which means somebody should re-think it, I'm not sure who though
[12:18:55] <kamila_>	 or should I just go ahead and add it? it's not like adding  a new thing will break anything...
[12:19:33] <volans>	 fwiw on wikitech to mention the cumin hosts I use a template that is then include in all pages referencing them https://wikitech.wikimedia.org/wiki/Template:CuminHosts
[12:20:42] <kamila_>	 ah, that's good to know, thanks volans 
[12:21:07] <volans>	 that said we do have some exceptions in the dicovery ORIGIN in dns that point directly to a host although I think we were supposed to avoid them, but that ship has sailed :D
[12:22:05] <kamila_>	 right
[12:24:26] <btullis>	 kamila_: There is now a patchset 13 on https://gerrit.wikimedia.org/r/c/analytics/superset/deploy/+/957938 - You can do the deploy to an-tool1005 as per those instructions above.
[12:24:35] <kamila_>	 thank you btullis!
[12:24:57] <btullis>	 I have reset deploy1002:/srv/deployment/analytics/superset/deploy back to the master branch.
[12:32:10] <kamila_>	 https://www.irccloud.com/pastebin/b2AO4s8s/
[12:32:20] <kamila_>	 btullis: looks good to me ^
[12:34:05] <kamila_>	 and it is the right commit
[12:45:29] <Emperor>	 There's an email to the sre list about where a particular email address goes. AFAICT our mxs will just send it to google - so would ITS know how a particular email is dealt with in our google account?
[12:48:49] <XioNoX>	 Emperor: not sure for @wikipedia emails
[12:49:35] <taavi>	 Emperor: there's an exim command that can be used to check where our mx hosts will route a specific email, https://wikitech.wikimedia.org/wiki/Exim#test_address_routing
[12:51:21] <Emperor>	 taavi: yes, I know, which is how I know mx1001 will send it to google (router = gsuite_account, transport = remote_smtp)
[12:51:49] <Emperor>	 XioNoX: exim -bt tells me it's routed to the same localpart at wikimedia.org and sent on to google, but it's what happens then I don't know
[12:52:12] <Emperor>	 taavi: sorry, that came over tetchier than I intended
[12:52:57] <XioNoX>	 worth following up with that info then, or try to send an email to this address asking who gets its :)
[12:53:19] <Emperor>	 I'll reply and CC ITS (who I think likely will know) :)
[12:53:45] <Emperor>	 oh, wait, no, ITS already sent them our way.
[12:53:45] <XioNoX>	 +1
[12:54:01] <btullis>	 kamila_: Many thanks.
[12:54:11] <Emperor>	 I think I'm going to CC them anyway, I think they will likely know about the gsuite accounts
[12:55:25] <kamila_>	 btullis: yw, sorry for breaking it :-D I'll make sure it's handled next time (not sure how, but somehow) 
[13:18:25] <btullis>	 kamila_: no problem at all :-)
[13:32:49] <_joe_>	 kamila_: I think just having a script on the deployment hosts that we can launch as part of the switchover would be enough
[13:34:53] <kamila_>	 Okay, thanks _joe_, will do 
[14:39:50] <herron>	 !incidents
[14:39:50] <sirenbot>	 4072 (UNACKED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw)
[14:40:05] <_joe_>	 herron: uhm
[14:40:26] <_joe_>	 let me take a quick look, in the mean time you can check the parsoid slow parse dashboard in logstash
[14:40:34] <herron>	 thanks _joe_ 
[14:40:40] <herron>	 !ack 4072
[14:40:40] <sirenbot>	 4072 (ACKED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw)
[14:41:12] <_joe_>	 and unsurprisingly, it's a flurry of zhwiki requests again
[14:41:29] <_joe_>	 akosiaris: when did you disable the requestctl rule?
[14:41:41] <akosiaris>	 hmm, lemme check
[14:42:13] <akosiaris>	 11:24 UTC
[14:42:18] <cdanis>	 jelto: here ^
[14:42:21] <herron>	 !incidents
[14:42:22] <sirenbot>	 4072 (ACKED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw)
[14:42:22] <akosiaris>	 ehmm sorry local, so 08:24 UTC
[14:42:22] <sirenbot>	 4073 (UNACKED)  ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw)
[14:42:38] <akosiaris>	 https://phabricator.wikimedia.org/T346657#9185948
[14:42:44] <akosiaris>	 _joe_: re-enable ? 
[14:42:46] <herron>	 !ack 4073
[14:42:47] <sirenbot>	 4073 (ACKED)  ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw)
[14:42:59] <_joe_>	 akosiaris: I don't think it's the reason this time, let me check though
[14:43:23] <_joe_>	 yeah nevermind, reenable :/
[14:43:29] <volans>	 or you could re-enable it in log only mode and see if it matches 
[14:43:33] <volans>	 ah nevermind :D
[14:43:42] <denisse>	 Here if I can help with anything.
[14:43:59] <_joe_>	 volans: it's a very small number of requests AIUI
[14:44:03] <akosiaris>	 done
[14:44:22] <_joe_>	 akosiaris: unless this time it's really mobileapps and not wikifeeds
[14:45:43] <jelto>	 thanks for enabling the rule, let's see if that helps
[14:45:43] <_joe_>	 yeah latency going down
[14:45:57] <_joe_>	 nemo-yiannis: ^^ :/
[14:47:19] <jelto>	 !incidents
[14:47:20] <sirenbot>	 4072 (ACKED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw)
[14:47:20] <sirenbot>	 4073 (RESOLVED)  ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw)
[14:49:32] <jelto>	 !incidents
[14:49:32] <sirenbot>	 4072 (RESOLVED)  PHPFPMTooBusy parsoid sre (php7.4-fpm.service codfw)
[14:49:33] <sirenbot>	 4073 (RESOLVED)  ProbeDown sre (10.2.1.28 ip4 parsoid-php:443 probes/service http_parsoid-php_ip4 codfw)
[14:49:38] <jelto>	 both resolved again
[14:51:00] <nemo-yiannis>	 yeah indeed its the same behaviour with language conversion on parsoid :/
[14:51:48] <_joe_>	 nemo-yiannis: the worst part is - this was 5 rps
[14:51:58] <_joe_>	 to wikifeeds I mean
[14:52:16] <_joe_>	 it's the fanout effect I'm worried about.
[14:53:52] <nemo-yiannis>	 What we did was to strip all `accept-language` headers that start with `zh-` to `zh` so restbase wont call pagebundle/to/pagebundle but it looks like this wasn't enough
[14:54:08] <nemo-yiannis>	 Do you have any idea how many requests caused the same issue last time? 
[14:54:30] <_joe_>	 it was about 3x what we got now
[14:54:35] <nemo-yiannis>	 ok
[14:54:59] <_joe_>	 requests to wikifeeds I mean
[14:55:43] <nemo-yiannis>	 maybe instead of replacing accept language from `zh-*` to `zh` we should force `zh` to all requests
[14:55:57] <nemo-yiannis>	 or the problem is not on pagebundle/to/pagebundle 
[14:56:15] <_joe_>	 actually that's what I see flooding parsoid in those moments
[14:56:21] <_joe_>	 so I do think that's the problem
[14:56:32] <_joe_>	 we had a flurry of such requests during the incident
[14:56:39] <_joe_>	 let me count how many on the parsoid hosts
[14:59:48] <_joe_>	 we got about 1.2k requests per minute for http://zh.wikipedia.org/w/rest.php/zh.wikipedia.org/v3/transform/pagebundle/to/pagebundle
[14:59:51] <_joe_>	 per server
[15:00:11] <_joe_>	 so that's 20 rps per server, which makes it 400 rps
[15:00:39] <bblack>	 what causes the fanout explosion?
[15:00:49] <_joe_>	 unless I'm missing something obvious, the biggest issue is there seems to be a large multiplication factor between wikifeeds and all the layers to parsoid
[15:00:59] <_joe_>	 bblack: I suspect it's the original url in wikifeeds
[15:01:11] <_joe_>	 it's a bundle of featured articles about a specific date
[15:01:14] <bblack>	 ok
[15:01:26] <_joe_>	 so if there's say 50 articles, it probably fans out 50 requests
[15:01:42] <bblack>	 in the past (way back) the big multiplication we saw with parsoid/rb, it was because it was retrying failures
[15:01:48] <_joe_>	 the road from the edge is
[15:02:05] <bblack>	 that was way back then, when we decided that nothing in our stack should retry a failure, except the outermost edge.
[15:02:17] <_joe_>	 ats -> rest gateway -> wikifeeds -> restbase -> mobileapps -> restbase -> parsoid
[15:02:18] <bblack>	 (so you don't cascade retry counts at every layer)
[15:03:16] <_joe_>	 I think the multiplication happens within wikifeeds
[15:13:03] <_joe_>	 ok this is interesting - we have a large baseline of requests for pagebundle/to/pagebundle, around 11 rps per server
[15:13:34] <_joe_>	 it went to about 20 rps per server during the incident
[15:13:47] <_joe_>	 and most of these requests take over 20 seconds to respond to
[15:16:49] <Krinkle>	 _joe_: looks like we have telemetry in EtcdConfig already, but wmf-config isn't passing a $logger. Can't really, given it's pre-config.
[15:17:16] <Krinkle>	 we can't set up monolog or anything given that confnig isnt known yet at that point, and none of the services loadable either 
[15:18:16] <Krinkle>	 given this class isn't likely to be used in another context, I think that means it's okay to safe in core and just make it use something more barebones instead, e.g. trigger_error or error_log etc and let native PHP handle it intead of through a dedicated logstash channel, which cant' work indeed
[15:23:12] <_joe_>	 Krinkle: sorry, in the middle of another rabbithole :)
[15:23:52] <_joe_>	 nemo-yiannis: so to recap, about 95% of POST requests to parsoid, and I suspect about 90% of the overall load, comes from this single source
[15:24:57] <_joe_>	 and even with the ban on wikifeeds, there's plenty of requests of that type coming from mobileapps
[16:16:26] <nemo-yiannis>	 got it, thanks _joe_ i am looking at it as we speak
[17:16:41] <nemo-yiannis>	 _joe_: The root cause is indeed the summary endpoint. For each wikifeed requests we send ~10s reqs to parsoid for `/page/html` and ~10s reqs to `/page/summary` (that then calls parsoid). The workaround I deployed today patched partially the zhwiki accept-language for`/page/html` requests, but we keep sending the requests to `/page/summary` with the locales that cause the issue (zh-).
[17:17:03] <nemo-yiannis>	 I suggest we wait for the proper fix on parsoid level on next train: https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/958593
[20:03:32] <_joe_>	 nemo-yiannis: ack, makes sense
[20:03:40] <_joe_>	 also, eek 
[20:36:51] <inflatador>	 does anyone know if our typical Dell chassis come with hot-swappable storage?
[20:54:48] <volans>	 yes, AFAIK all our servers have hot-swappable disks
[20:55:18] <volans>	 at least I'm not aware of exceptions
[21:16:49] <inflatador>	 ACK, thanks v-olans...confirmed in dcops too
[21:30:37] <cwhite>	 urbanecm: submitted too soon, updated my comment :)
[21:31:17] <urbanecm>	 cwhite: thanks! seems i'm replying to comments too early :D
[21:31:45] <urbanecm>	 can you help me with interpreting what it means please? 
[21:35:34] <cwhite>	 That's probably a better question for serviceops.  Afaik the slowlog captures traces of long-running requests it detects.  The exact details I'm not well-versed on, but the stack can show what the process is waiting on.
[21:39:06] <urbanecm>	 okay, thanks cwhite. do  you know if there is there something that might explain why the process is not able to read an existing file it has permissions for?
[21:39:29] <urbanecm>	 (if there's nothing obvious, i can ask serviceops, but i can't see the logs myself it seems)
[21:42:51] <cwhite>	 nothing obvious, I'm afraid :(
[21:43:29] <cwhite>	 here is all logs from that host around the time of the Cdb\Exception though: https://logstash.wikimedia.org/goto/8fe070310612f8b8640d644a33dbd35f
[21:44:25] <urbanecm>	 thank you