[01:21:37] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Kappakayala) Hi @jeena , looking at the comments looks like this is not related to mw-on-k8s migration as I see @Clement_Goubert reverted to bare metal... [02:32:08] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jeena) Yes, I think that is the correct assessment, so we still need to figure out how to solve this issue. [02:45:04] 10serviceops, 10MediaWiki-General, 10MediaWiki-Platform-Team, 10Traffic, and 4 others: MW returns uncacheable responses for en.wikipedia.org when specific XFF values are sent - https://phabricator.wikimedia.org/T350861 (10sbassett) [04:39:26] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) Do you have a trace of how these calls occur? In theory I think this shouldn't be happening because thumbnailing requests should be directed to Th... [04:55:37] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) What's supposed to happen is that the ThumbnailRender job makes an HTTP request to `http://ms-fe.svc.codfw.wmnet/wikipedia/commons/thumb/4/4f/Ambr... [05:27:04] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) Also there are ~100 errors per minute. ThumbnailRender tries to create four thumbnails per upload. There are usually 5-10 uploads per minute on Co... [07:10:40] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) If {T351400} is the cause, then I am unsure if this is an unbreak now, as that code has been running since January 5 (see https://grafana.wiki... [07:14:30] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Joe) No that is NOT the cause. The problem is also happening on jobrunners, I don't think that script actually spawns jobs. I think the root cause is t... [07:24:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) It was run with a `--use-jobqueue` parameter, that's pretty indicative of spawning jobs. [07:25:22] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Joe) >>! In T355243#9467881, @kostajh wrote: > If {T351400} is the cause, then I am unsure if this is an unbreak now, as that code has been running sin... [07:28:40] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9467899, @Joe wrote: >>>! In T355243#9467881, @kostajh wrote: >> If {T351400} is the cause, then I am unsure if this is an unbr... [07:30:18] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) @Joe do you want us to stop the script for now, and switch to not using the job queue? [07:32:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Joe) >>! In T355243#9467924, @kostajh wrote: > @Joe do you want us to stop the script for now, and switch to not using the job queue? I mean, right no... [07:39:21] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9467936, @Joe wrote: >>>! In T355243#9467924, @kostajh wrote: >> @Joe do you want us to stop the script for now, and switch to... [07:46:22] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video extension: [DRAFT] Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10Joe) p:05Triage→03High [08:03:25] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) The PhotoDNA API docs say //"Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp, or tiff) could be provided ... response ti... [08:07:01] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9468007, @Tgr wrote: > The PhotoDNA API docs say //"Alternatively, a publicly accessible URL of an image (gif, jpeg, png, bmp,... [08:14:19] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: [DRAFT] Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10Peachey88) [08:39:51] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) Thank you everyone for jumping on this. It's not clear to me at this point if this is train-related after all. Should this ticket still be con... [08:46:48] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9468080, @jnuche wrote: > Thank you everyone for jumping on this. > > It's not clear to me at this point if this is train-rela... [08:50:39] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) >>! In T355243#9468085, @kostajh wrote: >>>! In T355243#9468080, @jnuche wrote: >> Thank you everyone for jumping on this. >> >> It's not cle... [09:13:46] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [09:15:28] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) I've stopped the script running now and have removed {T354432} as a parent task. [09:20:45] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10jnuche) @kostajh @Dreamy_Jazz thank you, I can see the error rate going down. I'm going to proceed with the train. [09:37:37] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: [DRAFT] Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10Joe) [09:37:48] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10Joe) [09:43:08] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10Joe) Adding @brion as the resident expert / maintainer of TimedMediaHandler. I'd like to get your opinion on how hard it would be to port WebVideoTranscodeJob to use shell... [10:38:42] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10TheDJ) Related: T105951, T155114, T292322 [11:18:06] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10Clement_Goubert) Using the `orchestrator.namespace: ipoid` filter instead of `kubernetes.label.app: ipoid` works but only the logs for `ipoid-production-daily-up... [11:29:16] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [11:38:53] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan) [11:48:54] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) >>! In T355243#9468007, @Tgr wrote: > The PhotoDNA API docs say //"Alternatively, a publicly accessible URL of an image (gif, jpeg, png, b... [11:54:16] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [12:00:23] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10Clement_Goubert) The main app logs aren't yet in ECS format, cf https://gitlab.wikimedia.org/repos/mediawiki/services/ipoid/-/merge_requests/213 A split index da... [12:11:01] 10serviceops, 10iPoid-Service, 10Trust and Safety Product Sprint: ipoid logs not visible in Logstash - https://phabricator.wikimedia.org/T355247 (10kostajh) 05Open→03Resolved a:03kostajh Thanks. In the meantime, I think we can mark this as resolved. Thanks to you and @akosiaris for reworking the ipoid... [12:14:10] 10serviceops, 10Content-Transform-Team-WIP, 10Page Content Service, 10RESTBase Sunsetting, 10Patch-For-Review: Update mobileapps k8s deployment chart for Cassandra credentials - https://phabricator.wikimedia.org/T350507 (10Jgiannelos) @Eevans Sine things are moving forward, can devs have cqlsh access (re... [12:46:28] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10Clement_Goubert) Can you hold for hosts in codfw rows A and B for {T354869}? It's not a problem that hosts from these rows have already been changed over, we will just hav... [12:47:07] 10serviceops, 10MW-on-K8s: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074 (10Clement_Goubert) Can you hold for hosts in codfw rows A and B for {T354869}? It's not a problem that hosts from these rows have already been changed over, we will just have to dra... [13:53:13] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [14:27:10] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:47:00] 10serviceops, 10Infrastructure-Foundations, 10Puppet-Core, 10SRE, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [14:51:47] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) [14:55:41] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) p:05Unbreak!→03Triage After I backported the patch in {T355309} and restarted the script with the job queue method, I no longer see th... [15:09:02] 10serviceops, 10DC-Ops: Possible firmware issues reimaging mw2282 - https://phabricator.wikimedia.org/T355333 (10hnowlan) [15:11:04] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10brion) Couple quick notes: * Reducing thread count is IMO a very bad idea, as most of the time there will be few jobs and they may be high resolution videos. You want to u... [15:18:42] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1461.eqiad.wmnet with OS bullseye [15:18:47] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1469.eqiad.wmnet with OS bullseye [15:18:55] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by hnowlan@cumin1002 for host mw1439.eqiad.wmnet with OS bullseye [15:53:17] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) `lang=bash cgoubert@kubestage2002:~$ sudo calicoctl node status Calico process is running. IPv4 BGP status +---... [15:53:28] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1469.eqiad.wmnet with OS bullseye completed: - mw1469 (**PASS**) - Downtimed on Icinga/Alertma... [15:56:46] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1439.eqiad.wmnet with OS bullseye completed: - mw1439 (**PASS**) - Downtimed on Icinga/Alertma... [15:59:44] 10serviceops, 10MW-on-K8s, 10SRE: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by hnowlan@cumin1002 for host mw1461.eqiad.wmnet with OS bullseye completed: - mw1461 (**PASS**) - Downtimed on Icinga/Alertma... [16:11:02] sobanski: the trouble I had was with porting the python build process from buster to bullseye ( https://phabricator.wikimedia.org/T342346 ) [16:11:34] looks like I have abandoned all the patches last week as part of some cleanup of my gerrit dashboard [ https://gerrit.wikimedia.org/r/q/bug:T342346 ] [16:11:56] the series of patches can be restored and polished up with someone familiar with the python-build Docker images and the Makefile [16:17:47] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) No-op on these nodes, proceeding with the rest. [16:19:58] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) >>! In T352883#9469622, @Clement_Goubert wrote: > `lang=bash > IPv6 BGP status > +-------------------+----------... [16:41:32] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, 10netops: Test IP-renumbering on kubestage2002.codfw.wmnet - https://phabricator.wikimedia.org/T352883 (10Clement_Goubert) No-op on the rest of the infra. [16:43:02] 10serviceops, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10SRE, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10Clement_Goubert) Summary of deployment from {T352883}: - No-op on all nodes except kubestag... [17:28:41] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10kostajh) [17:58:24] 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, and 2 others: Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10EBernhardson) The extension is now documented and written, but still needs to finish code review... [18:05:48] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9468091, @kostajh wrote: > what's the correct way to stop a script that another user has run? Someone with root can kill it (send... [18:41:48] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9469332, @Dreamy_Jazz wrote: > it seems you cannot call `File::transform` with the `RENDER_NOW` flag while using a job. I don't th... [18:46:11] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9470344, @Tgr wrote: > The root issue is that RENDER_NOW breaks Thumbor integration. The same probably happens if you make a reques... [18:53:31] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10brion) Another complication on thread count -- the VP9 encoder can only make use of so many threads effectively, based on the size of the frame (controls number of macrobl... [18:59:50] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Dreamy_Jazz) >>! In T355243#9470344, @Tgr wrote: >>>! In T355243#9469332, @Dreamy_Jazz wrote: >> it seems you cannot call `File::transform` with the `R... [19:01:43] 10serviceops, 10MW-on-K8s, 10TimedMediaHandler, 10Video: Port videoscaling to kubernetes - https://phabricator.wikimedia.org/T355292 (10brion) A harder, but possibly desirable possibility I mentioned on IRC: we could encode each ~10-second input chunk separately, then stitch them back together on completio... [19:14:58] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: ThumbnailRender job calls $wgImageMagickConvertCommand - https://phabricator.wikimedia.org/T355243 (10Tgr) >>! In T355243#9470364, @Tgr wrote: > Although when I try this, there are a bunch of `Thumbor-*` headers on the response so it doesn't seem like i... [19:19:44] 10serviceops, 10API Platform, 10CirrusSearch, 10MediaWiki-Configuration, and 2 others: Provide a method for internal services to run api requests for private wikis - https://phabricator.wikimedia.org/T345185 (10Tgr) A token is the same level of protection we have for OAuth 2 and that's used all over the pl...