[00:08:48] 10serviceops, 10Dumps-Generation, 10MW-on-K8s, 10Release-Engineering-Team: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650 (10VirginiaPoundstone) Thanks for the flag @Milimetric. @Joe thanks for thinking this through. I have three follow-u... [01:27:58] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Bodhisattwa) There has been considerable improvement for most... [07:55:12] 10serviceops, 10Dumps-Generation, 10MW-on-K8s, 10Release-Engineering-Team: Migrate current-generation dumps to run from our containerized images - https://phabricator.wikimedia.org/T352650 (10Joe) >>! In T352650#9388845, @VirginiaPoundstone wrote: > @Joe thanks for thinking this through. I have three follo... [08:41:57] 10serviceops, 10Traffic: Handling inbound IPIP traffic on low traffic LVS k8s based realservers - https://phabricator.wikimedia.org/T352956 (10Vgutierrez) [08:51:51] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work): mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10dcausse) @hnowlan do you think we could move this job back to the old job runners as Erik suggests while this issue is getting... [09:20:35] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work): mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10JMeybohm) [09:20:44] 10serviceops, 10MW-on-K8s, 10SRE, 10Traffic, 10Release-Engineering-Team (Seen): Move MediaWiki jobs to mw-on-k8s - https://phabricator.wikimedia.org/T349796 (10JMeybohm) [09:21:22] hnowlan: o/ [09:21:40] when you have a moment do you mind to recheck https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/971113 before I merge? [09:21:58] after this I should be done with changeprop, I promise :D [09:32:09] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work), 10Patch-For-Review: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10akosiaris) @dcausse, since no `cirrusCheckerJob` exists, I assume we are talking about `cirrusSearchChec... [09:35:40] dcausse: I 'll revert the job back to baremetal now to stop the bleeding, but how we 'll we know it worked? [09:36:32] akosiaris: thanks! I'm looking at the graphs, can't tell if it worked yet but will tell you soon [09:36:48] oh, I haven't deployed yet [09:36:53] oh ok [09:37:06] that's why then :) [09:37:06] I need like another 2 minutes, waiting for CI [09:37:12] sure np! [09:39:16] basically the series cloudelactic.fixed should be almost flat in https://grafana-rw.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=1701938337210&to=1701941937210&viewPanel=35 [09:41:32] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work), 10Patch-For-Review: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10dcausse) @akosiaris thanks for the quick revert, the impact should be visible when looking at the cloude... [09:43:25] dcausse: deploy done [09:43:38] thanks! will monitor the graph and let you know [09:43:57] cool, thanks [09:48:02] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work), 10Patch-For-Review: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10dcausse) I can confirm that the last deploy worked, the fixed rate for cloudelastic is back to 0, thanks! [09:48:07] worked, thanks! [09:57:18] <_joe_> jayme: is there a place on logstash or grafana where I can answer the question "how often does thumbor get OOM killed"? [09:57:26] yw [09:58:17] _joe_: yes'ish [09:59:27] <_joe_> I am suspecting that that's the source of the poolcounter issues we have on thumbor [10:00:38] https://logstash.wikimedia.org/goto/9aed389c5258f43b6f0b481b5cbbb40c maybe...but that data does not seem correct [10:00:50] also it does not filter by actual reason for killing [10:01:04] kube-state-metrics should know better I guess [10:02:45] yeah, there is kube_pod_container_status_terminated_reason{reason="OOMKilled"} [10:02:56] <_joe_> if the client gets OOM killed, the poolcounter connection is not released properly and server-side it seems "up" until we hit the tcp timeout. Now the problem is that poolcounter counts active tcp connections [10:04:23] <_joe_> the difference with bare metal is that on bare metal what got the SIGKILL was just the shellout [10:04:46] <_joe_> that didn't affect the controlling python process [10:08:01] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work), 10Patch-For-Review: mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10akosiaris) Diving into the certificate thing now. The very first theory, that `ca-certificates` isn't in... [10:08:18] oh, I am looking at the wrong image, I need the envoy one [10:10:00] I am not even looking at the correct pod, sigh [10:11:01] <_joe_> ahah [10:12:55] well, 20200601~deb10u2 [10:13:00] that looks old [10:13:38] _joe_: ohhhh that makes a lot of sense :| [10:14:48] jayme: IIRC we can't get envoy in bullseye because libc version incompatibilities with the built binary, right ? [10:16:36] akosiaris: it's that we cant get envoy > 1.23 to run on buster [10:22:03] I see from T337649 just now a complaint that a 1.2G pdf isn't getting thumbnailed properly... [10:22:38] a 1.2G pdf ? [10:22:46] it's 430 pages [10:22:59] https://commons.wikimedia.org/wiki/File:%E0%A6%B8%E0%A6%AE%E0%A6%BE%E0%A6%9A%E0%A6%BE%E0%A6%B0_%E0%A6%A6%E0%A6%B0%E0%A7%8D%E0%A6%AA%E0%A6%A3_-_%E0%A6%96%E0%A6%A3%E0%A7%8D%E0%A6%A1_%E0%A7%A7%E0%A7%AE_(%E0%A7%A7%E0%A7%AE%E0%A7%A9%E0%A7%AC).pdf [10:23:07] all scanned as TIFFs at the maximum resolution ? [10:23:22] pass, I'm still waiting for it to fit down my wet string so I can look at it [10:25:41] ah, yes, the metadata says its constructed by gscan2pdf and each page is 866.88 x 1226.2 pts or similar [10:31:42] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work): mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10MoritzMuehlenhoff) >>! In T352906#9389619, @akosiaris wrote: > Edit: this is just plain wrong, I was looking at the wrong pod+... [11:10:52] <_joe_> Emperor: I would argue that the problem is the user experience is bad. We should know better than trying to show thumbnails of such large files on-wiki [11:12:06] Mmm, that PDF is really 430 images glued together, which doesn't seem ideal in a number of regards [11:13:38] 10serviceops, 10MinT, 10Language-Team (Language-2023-October-December), 10Patch-For-Review: Provide python3-build-bookworm docker image - https://phabricator.wikimedia.org/T352733 (10Clement_Goubert) Summary of the discussion on the CR: - Since bookworm, installing `pip` packages system-wide triggers an er... [11:37:59] <_joe_> Emperor: yes but People Will Upload Weird Things [11:38:07] <_joe_> we need to account for that [11:43:16] thumbor having an awareness of things there are zero hope of it ever serving would save us a lot of headaches/tickets [11:48:29] "make a thumbnail of page X of this PDF" wouldn't be hard were it not for the time taken just to transfer the whole lot to the thumbnailing system [12:02:54] <_joe_> hnowlan: I'm talking mediawiki first of all [12:03:14] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13), 10Patch-For-Review: Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10hnowlan) >>! In T337649#9388968, @Bodhisattwa wrote: > There... [12:03:32] <_joe_> but yes, it would be great too... but that would need us to add an admission layer to thumbor that would talk to the mw api to retrieve the metadata I guess [12:03:34] _joe_: sure, why not both [12:03:52] <_joe_> or we could also just add some admission parameter [12:04:02] <_joe_> like file size depending on type [12:04:13] <_joe_> without needing to get the metadata [12:04:31] <_joe_> hnowlan: how many rps does thumbor do? [12:06:53] _joe_: about 20 rps per dc [12:08:03] wait, no [12:08:57] about 60 [12:15:53] 10serviceops, 10MW-on-K8s, 10Discovery-Search (Current work): mediawiki k8s jobrunner fails connecting to cloudelastic with a TLS error - https://phabricator.wikimedia.org/T352906 (10akosiaris) >>! In T352906#9389671, @MoritzMuehlenhoff wrote: >>>! In T352906#9389619, @akosiaris wrote: >> Edit: this is just... [14:22:02] Hey! I have this patch for mobileapps staging to disable a feature (that we don't use either way) because it causes timeouts. Can somebody take a look? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/981330 [14:24:15] nemo-yiannis: +1ed [14:24:38] thanks! [14:49:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2004.codfw.wmnet with OS bullseye [14:49:08] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye [14:49:14] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye [15:03:07] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) [15:03:43] 10serviceops, 10Content-Transform-Team-WIP, 10RESTBase, 10RESTBase Sunsetting, and 6 others: PCS caching and pregeneration when restbase is decommissioned - https://phabricator.wikimedia.org/T319365 (10Jgiannelos) [15:08:41] 10serviceops, 10Content-Transform-Team-WIP, 10Parsoid, 10RESTBase, and 3 others: Requests originating from zhwiki wikifeeds caused parsoid outage - https://phabricator.wikimedia.org/T346657 (10Jgiannelos) 05Open→03Resolved a:03Jgiannelos [15:33:52] 10serviceops, 10Content-Transform-Team-WIP, 10Maintenance-Worktype: Improve log readability of kubernetes applications when logging in debug level - https://phabricator.wikimedia.org/T347717 (10Jgiannelos) 05Open→03Resolved [15:36:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2005.codfw.wmnet with OS bullseye executed with errors: -... [15:37:10] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host sessionstore2006.codfw.wmnet with OS bullseye executed with errors: -... [16:08:50] 10serviceops, 10AbuseFilter, 10PHP 7.4 support: Regular expression "х[ÿý]и" match "х и" in Abusefilter - https://phabricator.wikimedia.org/T340068 (10jijiki) Are there any actionables here for #serviceops ? [17:17:49] hello! FYI I plan to deploy https://gerrit.wikimedia.org/r/c/mediawiki/services/change-propagation/+/974267/ https://phabricator.wikimedia.org/T351247 - change propagation should discard canary events on Monday Dec 11. Please let me know if there are any objections. [17:34:02] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Jhancock.wm) having an issue with all the new sessionstore servers that I think stems from the HBA355i Fnt card. When the install gets to partitioning the drives, I... [20:16:32] 10serviceops, 10Data-Engineering, 10Data-Platform-SRE, 10SRE, and 3 others: Upgrade Kafka to from 1.x to later version - https://phabricator.wikimedia.org/T300102 (10Ottomata) [20:37:26] 10serviceops: Multiple images fail to build from sources - https://phabricator.wikimedia.org/T350366 (10Ottomata) Hm, is it possible this was a temporary issue with downloading https://archive.apache.org/dist/flink/flink-kubernetes-operator-1.4.0/flink-kubernetes-operator-1.4.0-src.tgz.sha512 ? It downloads fin... [21:42:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q2:rack/setup/install 3 sessionstore hosts (codfw) - https://phabricator.wikimedia.org/T349876 (10Eevans) >>! In T349876#9391164, @Jhancock.wm wrote: > having an issue with all the new sessionstore servers that I think stems from the HBA355i Fnt card. > > When t... [21:46:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install 3 sessionstore hosts (eqiad) - https://phabricator.wikimedia.org/T349875 (10Jclark-ctr) [21:51:16] 10serviceops, 10MW-on-K8s: Handle sidecar containers in one-off Kubernetes jobs - https://phabricator.wikimedia.org/T348284 (10RLazarus) @JMeybohm Can you help with an RBAC issue? I'm adapting the manifest from [[https://github.com/otherguy/k8s-controller-sidecars/blob/main/manifest.yml | upstream]], my curre... [21:58:14] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) [22:05:42] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10Patch-For-Review: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) [22:15:44] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye [22:16:36] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye [22:16:39] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye [22:16:41] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye [23:52:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1060.eqiad.wmnet with OS bullseye completed: - kubernetes1060 (**WARN**)... [23:52:31] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1059.eqiad.wmnet with OS bullseye completed: - kubernetes1059 (**PASS**)... [23:52:33] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1062.eqiad.wmnet with OS bullseye completed: - kubernetes1062 (**WARN**)... [23:52:37] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1061.eqiad.wmnet with OS bullseye completed: - kubernetes1061 (**WARN**)... [23:53:59] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) a:05VRiley-WMF→03Jclark-ctr [23:54:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q2:rack/setup/install 4 parsoid hosts - https://phabricator.wikimedia.org/T349874 (10Jclark-ctr) 05Open→03Resolved