[00:55:31] 10serviceops, 10Shellbox, 10SyntaxHighlight, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), and 2 others: Pages with Pygments or Timeline intermittenly fail to render (Shellbox server returned status code 503) - https://phabricator.wikimedia.org/T292663 (10Krinkle) 05Open→03Resolved [01:11:15] 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) I will look if I can get an ssh connection. Worst case we can resize the instance and increase our monthly bill be a few bucks. Thanks for not... [01:18:33] 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10RLazarus) a:03Andrew [01:34:54] 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) A ton of files in /srv/mediawiki/images/wikitech/archive but deleteArchivedFiles.php --delete says there's nothing to delete. It's tempting to... [05:50:59] 10serviceops, 10All-and-every-Wikisource, 10Thumbor: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10KTT-Commons) As a normal user who know nothing about the subliminal coding, I'd like to report that the problem seems to be allevi... [07:35:55] 10serviceops, 10SRE, 10ops-eqiad: mw1492 is down - https://phabricator.wikimedia.org/T338566 (10MoritzMuehlenhoff) [09:37:10] 10serviceops, 10SRE, 10ops-eqiad: mw1492 is down - https://phabricator.wikimedia.org/T338566 (10elukey) ` elukey@cumin1001:~$ sudo ipmitool -I lanplus -H "mw1492.mgmt.eqiad.wmnet" -U root -E mc reset cold Unable to read password from environment Password: Error: Unable to establish IPMI v2 / RMCP+ session `... [09:43:59] hello folks [09:44:07] anything against me applying https://phabricator.wikimedia.org/T338357#8917297 ? [09:44:12] to both kafka main clusters [09:44:26] in the past we applied the same changes and it worked nicely [09:44:37] kafka handles it transparently [09:48:15] elukey: possible issues ? [09:48:54] (and by that I mean is there a way this could make things worse for eventgate speed / the cluster as a whole ?) [09:49:44] claime: in theory no, the producers (eventgate) will start issuing new messages/events to the new partitions transparently after a bit, and the load should spread among multiple brokers (same things for consumers) [09:50:11] elukey: then I have no objection [09:51:42] very well, proceeding :) [09:56:20] {{done}} [09:58:35] <_joe_> prod changes on a friday [09:58:42] <_joe_> we're getting naughty here [10:02:05] yes this one is borderline, my bad [10:03:58] <_joe_> elukey: I'm just teasing the two of you :D [10:06:17] aaand I believe that changeprop didn't like it very much, so I'll roll restart the pods [10:06:21] sigh [10:06:23] claime: ok to proceed? [10:06:51] I always forget that changeprop hates me [10:07:04] since it uses an ancient kafka client [10:07:11] elukey: yeah go ahead [10:12:48] checking metrics, something is moving [10:12:54] * elukey sigh kafka and changeprop [10:13:40] jobrunners working hard [10:13:48] but that's expected [10:15:52] ah but wait, transclusions are probably on changeprop-jobqueue right? [10:15:57] lemme check [10:16:35] <_joe_> no [10:16:37] <_joe_> they're not [10:16:41] nope [10:16:52] ok then I rolled restart the correct one, still the messages flowing is low [10:16:55] weird [10:16:59] <_joe_> claime: elukey's change shouldn't impact the jobrunners [10:17:40] https://grafana.wikimedia.org/goto/ZngXxx_Vz?orgId=1 [10:17:45] Then something else happened ? [10:18:18] <_joe_> https://grafana.wikimedia.org/d/wqj6s-unk/jobrunners?orgId=1&viewPanel=54&from=now-12h&to=now very worrisome trend [10:18:24] <_joe_> and we know what changed [10:18:44] Well yeah [10:18:56] We may need to throw more hardware at the problem [10:19:02] <_joe_> with transcludes lagging behind, more jobs from parsoidPreWarm actually do the work [10:19:24] <_joe_> yes, keep an eye on the idle workers, both in the parsoid cluster and this one, and maybe move more servers [10:19:31] <_joe_> we knew we'd need it eventually [10:19:53] I may need some brainbounce here, see [10:19:54] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&viewPanel=17 [10:20:05] (select transcludes-resource-change) [10:20:26] but, the dedupe rate went done [10:20:29] *down [10:20:30] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&viewPanel=24 [10:20:36] by more or less the same amount [10:20:46] 10serviceops, 10All-and-every-Wikisource, 10Thumbor: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Snaevar) >>! In T337649#8917000, @KTT-Commons wrote: > As a normal user who know nothing about the subliminal coding, I'd like to... [10:21:01] <_joe_> elukey: what I see is that the big flood of transcludes finished just before you did the restart? [10:21:43] _joe_ not sure, the traffic to the topic is less now afaics https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.change-prop.transcludes.resource-change&from=now-3h&to=now [10:21:54] but changeprop is still working fine, nothing horrible in the logs [10:21:58] same for eventgate [10:24:24] <_joe_> elukey: I don't know what to tell you tbh [10:25:11] <_joe_> I would give it an hour, see if things get up to speed again [10:25:28] 10serviceops, 10RESTbase Sunsetting, 10Parsoid (Tracking): Enable WarmParsoidParserCache on all wikis - https://phabricator.wikimedia.org/T329366 (10Clement_Goubert) We need to move more servers from the parsoid cluster to the jobrunners: Parsoid saturation last 24h: {F37099007} Jobrunners saturation last 24... [10:25:35] +1 I am thinking the same, I am very worried by the state of the kafka client in changeprop [10:25:39] we should really upgrade it [10:25:49] <_joe_> if you zoom out you can see this is nothing particularly strange [10:25:51] <_joe_> https://grafana.wikimedia.org/d/000000234/kafka-by-topic?orgId=1&refresh=5m&var-datasource=eqiad%20prometheus%2Fops&var-kafka_cluster=main-eqiad&var-kafka_broker=All&var-topic=eqiad.change-prop.transcludes.resource-change&from=now-30d&to=now&viewPanel=6 [10:26:02] <_joe_> elukey: indeed. Someone needs to take care of changeprop [10:26:05] * elukey nods [10:26:35] <_joe_> so what happens here I think is [10:27:25] <_joe_> if a big root job (i.e. a template with a lot of usage has changed) you will get a progressive explosion of it into individual events on this queue [10:29:05] (in the meantime traffic is following a better trend) [10:33:56] <_joe_> "better" [10:34:16] 10serviceops: operations/docker-images/production-images contains references to non-existent image python3 - https://phabricator.wikimedia.org/T336682 (10Clement_Goubert) Proposal to update `prometheus-nutcracker-exporter` to python3-bullseye, and remove `python3-build-stretch`, `python3-devel` and `python3` ima... [10:34:25] <_joe_> considering I think all those jobs should be removed :P [10:47:09] ack seems all good now, will check later :) [15:12:20] 10serviceops, 10Machine-Learning-Team: Can't delete images from docker registry (from build2001 using docker-registryctl) - https://phabricator.wikimedia.org/T338623 (10klausman) [15:21:21] 10serviceops, 10Machine-Learning-Team: Can't delete images from docker registry (from build2001 using docker-registryctl) - https://phabricator.wikimedia.org/T338623 (10klausman) 05Open→03Invalid This was caused by me using the wrong host. What I _should_ have used: `docker-registryctl delete-tags docke... [15:31:48] 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) ` root@wikitech-static:/srv/mediawiki/images/wikitech/archive# df -h Filesystem Size Used Avail Use% Mounted on udev 979M... [15:32:31] 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10Andrew) 05Open→03Resolved > > No let's see if that was actually the problem... It was! [16:17:17] 10serviceops, 10Shellbox, 10wikitech.wikimedia.org: Shellbox is broken on wikitech-static due to disk fullness - https://phabricator.wikimedia.org/T338520 (10RLazarus) Thank you! [17:08:56] 10serviceops, 10Wikimedia-Developer-Portal, 10Kubernetes: Deployment of developer-portal into 'staging' k8s cluster failing due to insufficient cpu and node taints - https://phabricator.wikimedia.org/T338493 (10bd808) ` $ kubectl get pods NAME READY STATUS RESTART... [17:11:43] * bd808 doesn't have any ideas how he can fix T338493 [19:09:18] 10serviceops, 10SRE, 10envoy: Remove tls_minimum_protocol_version from envoy config - https://phabricator.wikimedia.org/T337453 (10BCornwall)