[05:37:55] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10Joe) >>! In T329366#8910189, @Clement_Goubert wrote: > Following this deployment and backlog times growing, @Ladsgroup added a specific lane for `parsoidCachePr...
[05:43:54] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Joe) >>! In T337649#8909345, @Ladsgroup wrote: > Hmm, maybe because of envoy or some other networking magic in k8s, the connection...
[06:32:48] <elukey>	 hi folks
[06:33:19] <elukey>	 there is a session store pod not running on dedicated node in codfw
[06:34:06] <elukey>	 should we just delete the pod and check the newer one, or is there a different procedure?
[07:13:07] <_joe_>	 elukey: that seems correct to me
[07:13:15] <_joe_>	 sorry I didn't notice your messages here
[07:13:25] <elukey>	 o/
[07:13:31] <elukey>	 ack I can take care of it
[07:14:25] <elukey>	 done :)
[07:17:27] <_joe_>	 <3
[07:57:35] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Joe) While the patches @Ladsgroup and I created would kind-of work to a level, the real issue here is structural:  * Poolcounter l...
[08:15:04] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10Clement_Goubert) We ended up bumping to 60 https://gerrit.wikimedia.org/r/928120 because backlog started growing again.
[08:23:08] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10Clement_Goubert) For future reference, what would be the consequence of these jobs being held for more than 5 minutes?
[08:23:39] <claime>	 elukey: can I have your take on this ? https://phabricator.wikimedia.org/T338357
[08:27:40] <elukey>	 claime: o/ I think that we'd need more details about it, like how much time is being waited (is it unreasonable or not?) and if there are specific jobs that have wait times etc..
[08:28:15] <elukey>	 a while ago we spread more the load on kafka brokers and increased partitions for some big topics
[08:28:55] <elukey>	 but the improvement in perfs was a side effect, it wasn't part of why we did the rebalance
[08:29:21] <elukey>	 we do have some brokers doing less than others, see
[08:29:22] <elukey>	 https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=48
[08:30:32] <elukey>	 so we could think about a rebalance but I'd prefer more details before scheduling the work, so far it seems that we are trying something with Kafka without many proofs about it being the culprit
[08:30:37] <elukey>	 I'll write something in the task
[08:30:47] <claime>	 <3
[08:31:45] <elukey>	 done :)
[08:46:10] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10Joe) >>! In T329366#8912637, @Clement_Goubert wrote: > For future reference, what would be the consequence of these jobs being held for more than 5 minutes?  It...
[14:12:38] <claime>	 mszabo: So I've kept at it with the draining envoy and all that jazz
[14:13:12] <claime>	 And in the end, just sleeping for a bit more than our longest request spikes in a preStop hook on all the containers in the pods works just fine
[14:14:08] <claime>	 No draining, just sleep for 7 seconds in envoy, 8 in all other containers
[14:16:19] <_joe_>	 we just decided we're ok with losing anything lasting more than the p99 of POSTs
[15:03:28] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey)
[15:03:53] <elukey>	 hello folks, I created --^ to get some brainbounce about the current recommendation-api
[15:03:58] <elukey>	 lemme know your thoughts :)
[15:22:34] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) Adding some more information:  The service was maintained by @bmansurov. It was deployed on the scb cluster. I am the one that moved it to Wikikub...
[15:26:03] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) Oh I forgot to add that we have https://meta.wikimedia.org/wiki/Recommendation_API for explaining what it is. Finally the referers in turnilo impl...
[15:27:07] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10akosiaris) My personal take is btw that is unowned. I 'd say Code Stewardship request and maybe it's enough of a lost cause that we undeploy it?
[15:46:01] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Replace the current recommendation-api service with a newer version - https://phabricator.wikimedia.org/T338471 (10elukey) Thanks for the info!  >>! In T338471#8914520, @akosiaris wrote: > Oh I forgot to add that we have https://meta.wikimedia.org/wiki/Recommendation_API...
[15:59:42] <James_F>	 Is anyone around to push out a trivial blackbox timeout change for us? https://gerrit.wikimedia.org/r/c/operations/puppet/+/928594
[17:17:45] <mutante>	 yes, I will do it
[17:17:48] <mutante>	 per "beta"
[17:18:48] <mutante>	 James_F: merged on prod master, do you need puppet runs ?
[17:19:03] <mutante>	 or just wait <= 30 m
[17:19:08] <James_F>	 mutante: Happy to wait.
[17:19:27] <mutante>	 alright, done
[17:19:38] <James_F>	 Thanks!
[17:22:19] <mutante>	 yw
[18:18:02] <wikibugs>	 10serviceops, 10Wikimedia-Developer-Portal, 10Kubernetes: Deployment of developer-portal into 'staging' k8s cluster failing due to insufficient cpu and node taints - https://phabricator.wikimedia.org/T338493 (10bd808)
[18:18:41] <bd808>	 I moved on and updated the codfw and eqiad deployments of developer-portal, but I have no idea how to fix ^
[21:24:16] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static - https://phabricator.wikimedia.org/T338520 (10RLazarus) p:05Triage→03High
[22:03:22] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static - https://phabricator.wikimedia.org/T338520 (10RLazarus) Looks like disk fullness:  ` root@wikitech-static:~# df -h Filesystem      Size  Used Avail Use% Mounted on udev            979M     0  979M   0% /dev tmpfs...
[23:34:34] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static - https://phabricator.wikimedia.org/T338520 (10RLazarus) @Andrew Is this something you can take a look at?
[23:48:15] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10RLazarus)