[00:55:31] <wikibugs>	 10serviceops, 10Shellbox, 10SyntaxHighlight, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), and 2 others: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle) 05Open→03Resolved
[01:11:15] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) I will look if I can get an ssh connection.  Worst case we can resize the instance and increase our monthly bill be a few bucks. Thanks for not...
[01:18:33] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10RLazarus) a:03Andrew
[01:34:54] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) A ton of files in /srv/mediawiki/images/wikitech/archive but deleteArchivedFiles.php  --delete says there's nothing to delete. It's tempting to...
[05:50:59] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10KTT-Commons) As a normal user who know nothing about the subliminal coding, I'd like to report that the problem seems to be allevi...
[07:35:55] <wikibugs>	 10serviceops, 10SRE, 10ops-eqiad: mw1492 is down - https://phabricator.wikimedia.org/T338566 (10MoritzMuehlenhoff)
[09:37:10] <wikibugs>	 10serviceops, 10SRE, 10ops-eqiad: mw1492 is down - https://phabricator.wikimedia.org/T338566 (10elukey) ` elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "mw1492.mgmt.eqiad.wmnet" -U root -E mc reset cold Unable to read password from environment Password:  Error: Unable to establish IPMI v2 / RMCP+ session `...
[09:43:59] <elukey>	 hello folks
[09:44:07] <elukey>	 anything against me applying https://phabricator.wikimedia.org/T338357#8917297 ?
[09:44:12] <elukey>	 to both kafka main clusters
[09:44:26] <elukey>	 in the past we applied the same changes and it worked nicely
[09:44:37] <elukey>	 kafka handles it transparently
[09:48:15] <claime>	 elukey: possible issues ?
[09:48:54] <claime>	 (and by that I mean is there a way this could make things worse for eventgate speed / the cluster as a whole ?)
[09:49:44] <elukey>	 claime: in theory no, the producers (eventgate) will start issuing new messages/events to the new partitions transparently after a bit, and the load should spread among multiple brokers (same things for consumers)
[09:50:11] <claime>	 elukey: then I have no objection
[09:51:42] <elukey>	 very well, proceeding :)
[09:56:20] <elukey>	 {{done}}
[09:58:35] <_joe_>	 prod changes on a friday
[09:58:42] <_joe_>	 we're getting naughty here
[10:02:05] <elukey>	 yes this one is borderline, my bad
[10:03:58] <_joe_>	 elukey: I'm just teasing the two of you :D
[10:06:17] <elukey>	 aaand I believe that changeprop didn't like it very much, so I'll roll restart the pods
[10:06:21] <elukey>	 sigh
[10:06:23] <elukey>	 claime: ok to proceed?
[10:06:51] <elukey>	 I always forget that changeprop hates me 
[10:07:04] <elukey>	 since it uses an ancient kafka client
[10:07:11] <claime>	 elukey: yeah go ahead
[10:12:48] <elukey>	 checking metrics, something is moving
[10:12:54] * elukey sigh kafka and changeprop
[10:13:40] <claime>	 jobrunners working hard
[10:13:48] <claime>	 but that's expected
[10:15:52] <elukey>	 ah but wait, transclusions are probably on changeprop-jobqueue right?
[10:15:57] <elukey>	 lemme check
[10:16:35] <_joe_>	 no
[10:16:37] <_joe_>	 they're not
[10:16:41] <claime>	 nope
[10:16:52] <elukey>	 ok then I rolled restart the correct one, still the messages flowing is low
[10:16:55] <elukey>	 weird
[10:16:59] <_joe_>	 claime: elukey's change shouldn't impact the jobrunners
[10:17:40] <claime>	 https://grafana.wikimedia.org/goto/ZngXxx_Vz?orgId=1
[10:17:45] <claime>	 Then something else happened ?
[10:18:18] <_joe_>	 https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=54&from=now-12h&to=now very worrisome trend
[10:18:24] <_joe_>	 and we know what changed
[10:18:44] <claime>	 Well yeah
[10:18:56] <claime>	 We may need to throw more hardware at the problem
[10:19:02] <_joe_>	 with transcludes lagging behind, more jobs from parsoidPreWarm actually do the work
[10:19:24] <_joe_>	 yes, keep an eye on the idle workers, both in the parsoid cluster and this one, and maybe move more servers
[10:19:31] <_joe_>	 we knew we'd need it eventually
[10:19:53] <elukey>	 I may need some brainbounce here, see 
[10:19:54] <elukey>	 https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&viewPanel=17
[10:20:05] <elukey>	 (select transcludes-resource-change)
[10:20:26] <elukey>	 but, the dedupe rate went done
[10:20:29] <elukey>	 *down
[10:20:30] <elukey>	 https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&viewPanel=24
[10:20:36] <elukey>	 by more or less the same amount
[10:20:46] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Snaevar) >>! In T337649#8917000, @KTT-Commons wrote: > As a normal user who know nothing about the subliminal coding, I'd like to...
[10:21:01] <_joe_>	 elukey: what I see is that the big flood of transcludes finished just before you did the restart?
[10:21:43] <elukey>	 _joe_ not sure, the traffic to the topic is less now afaics https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.change-prop.transcludes.resource-change&from=now-3h&to=now
[10:21:54] <elukey>	 but changeprop is still working fine, nothing horrible in the logs
[10:21:58] <elukey>	 same for eventgate
[10:24:24] <_joe_>	 elukey: I don't know what to tell you tbh
[10:25:11] <_joe_>	 I would give it an hour, see if things get up to speed again
[10:25:28] <wikibugs>	 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10Clement_Goubert) We need to move more servers from the parsoid cluster to the jobrunners: Parsoid saturation last 24h: {F37099007} Jobrunners saturation last 24...
[10:25:35] <elukey>	 +1 I am thinking the same, I am very worried by the state of the kafka client in changeprop
[10:25:39] <elukey>	 we should really upgrade it
[10:25:49] <_joe_>	 if you zoom out you can see this is nothing particularly strange
[10:25:51] <_joe_>	 https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.change-prop.transcludes.resource-change&from=now-30d&to=now&viewPanel=6
[10:26:02] <_joe_>	 elukey: indeed. Someone needs to take care of changeprop
[10:26:05] * elukey nods
[10:26:35] <_joe_>	 so what happens here I think is
[10:27:25] <_joe_>	 if a big root job (i.e. a template with a lot of usage has changed) you will get a progressive explosion of it into individual events on this queue
[10:29:05] <elukey>	 (in the meantime traffic is following a better trend)
[10:33:56] <_joe_>	 "better"
[10:34:16] <wikibugs>	 10serviceops: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10Clement_Goubert) Proposal to update `prometheus-nutcracker-exporter` to python3-bullseye, and remove `python3-build-stretch`, `python3-devel` and `python3` ima...
[10:34:25] <_joe_>	 considering I think all those jobs should be removed :P
[10:47:09] <elukey>	 ack seems all good now, will check later :)
[15:12:20] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Can't delete images from docker registry (from build2001 using docker-registryctl) - https://phabricator.wikimedia.org/T338623 (10klausman)
[15:21:21] <wikibugs>	 10serviceops, 10Machine-Learning-Team: Can't delete images from docker registry (from build2001 using docker-registryctl) - https://phabricator.wikimedia.org/T338623 (10klausman) 05Open→03Invalid This was caused by me using the wrong host.  What I _should_ have used:  `docker-registryctl  delete-tags docke...
[15:31:48] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) ` root@wikitech-static:/srv/mediawiki/images/wikitech/archive# df -h Filesystem      Size  Used Avail Use% Mounted on udev            979M...
[15:32:31] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) 05Open→03Resolved >  > No let's see if that was actually the problem...  It was!
[16:17:17] <wikibugs>	 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10RLazarus) Thank you!
[17:08:56] <wikibugs>	 10serviceops, 10Wikimedia-Developer-Portal, 10Kubernetes: Deployment of developer-portal into 'staging' k8s cluster failing due to insufficient cpu and node taints - https://phabricator.wikimedia.org/T338493 (10bd808) ` $ kubectl get pods NAME                                     READY   STATUS        RESTART...
[17:11:43] * bd808 doesn't have any ideas how he can fix T338493
[19:09:18] <wikibugs>	 10serviceops, 10SRE, 10envoy: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10BCornwall)