[04:53:27] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10wiki_willy) [04:54:03] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Q1:(Need By: TBD) rack/setup/install kubernetes10[18-21] - https://phabricator.wikimedia.org/T290202 (10wiki_willy) Updated task description based on @JMeybohm's comment >>! In T290202#7327682, @JMeybohm wrote: > The latest kubernetes node there is is kubernetes... [06:29:59] good morning folks, I can finally say that we have an ores service deployed via helm3 that works :) [06:30:02] \o/ [06:33:08] <_joe_> oh wow [06:33:16] <_joe_> that's a great milestone [06:45:31] <_joe_> elukey: does it work too? [07:09:29] _joe_ yep yep it return a score for a given edit [07:09:41] (atm I just deployed one model for enwiki, we'll need to add more) [07:14:28] we are using api-ro to get edits btw, in case it is not the right one lemme know [07:14:52] <_joe_> definitely is if you're just reading [07:14:54] <_joe_> but [07:15:03] our endpoint (inference.discovery.wmnet) will be behind api-gateway at some point [07:15:03] <_joe_> why aren't you using envoy as a middleware? [07:23:01] _joe_ yep we did, there should also be a couple of nodes for a staging cluster [07:24:59] there will also be a codfw cluster in active-active [07:25:08] so for the moment something like 8 worker nodes [07:25:58] I think that in Q2 we'll get other 8 worker nodes in total [07:26:05] so 8w nodes per cluster [09:21:33] 10serviceops, 10Lift-Wing, 10Kubernetes, 10Machine-Learning-Team (Active Tasks), 10Patch-For-Review: Discussion: dedicated directory in the deployment-chart repository for ML services - https://phabricator.wikimedia.org/T286791 (10elukey) 05Open→03Resolved a:03elukey This has been implemented, and... [09:36:49] 10serviceops, 10Observability-Metrics, 10User-jijiki: Measure segfaults in mediawiki and parsoid servers - https://phabricator.wikimedia.org/T246470 (10elukey) I can confirm, thanks a lot for the quick fix! [12:18:20] 10serviceops, 10Release-Engineering-Team, 10GitLab (Infrastructure), 10Patch-For-Review, 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10Jelto) I prepared the version bump to for gitlab-ce and gitlab-runner to 14.3 in https://gerrit.wikimedia.org/r/725303. @bre... [12:46:15] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 3 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10dcausse) p:05Triage→03High From an MW app server I can't connect to any o... [12:47:46] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 4 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10jcrespo) [13:48:19] 10serviceops, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Doing): scap's canary check gives confusing logstash link - https://phabricator.wikimedia.org/T291870 (10hashar) [13:48:35] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review, 10Release-Engineering-Team (Next), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10hashar) [14:32:59] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Traffic, and 3 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10jcrespo) Errors seem to have receded a lot since 14:05: {F34664037} [14:33:41] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 4 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10jcrespo) [15:10:08] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 5 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10BBlack) Recapping from an IRC conversation: this was a fallout of the great L... [15:27:09] 10serviceops, 10CirrusSearch, 10Discovery-Search, 10Infrastructure-Foundations, and 6 others: Half a million of CirrusSearch jobqueue execution errors per hour since 2021-09-30 16:02 - https://phabricator.wikimedia.org/T292291 (10jcrespo) For more longer term, I also would like to wonder if there something... [15:43:59] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.1 - https://phabricator.wikimedia.org/T291095 (10dancy) >>! In T291095#7390898, @jijiki wrote: > @dancy it would be lovely if we can speed this up, right now we have `deploy1002` and `maps*` on version 3.17.1, and the rest on... [15:57:30] 10serviceops, 10GitLab (Infrastructure), 10Patch-For-Review, 10Release-Engineering-Team (Next), 10User-brennen: GitLab minor release: 14.3.1 - https://phabricator.wikimedia.org/T292256 (10brennen) > @brennen did you plan to upgrade the runners as well? Otherwise I can remove them from the change. Yeah,... [16:26:28] #wikimedia-operations [16:26:34] oops, nevermind [16:44:48] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.2 - https://phabricator.wikimedia.org/T291095 (10dancy) 05Stalled→03Open [16:45:03] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.2 - https://phabricator.wikimedia.org/T291095 (10dancy) @jijiki 4.0.2 is tagged and ready for a retry. [16:46:40] I had a "doh! That was dumb." epiphany about Toolhub and url-downloader last night. I was setting http_proxy as an envvar, but not https_proxy. I thought incorrectly that https://.... requests would fall back to the http_proxy but they do not. [16:48:25] After fixing that, my crawler now errors differently. Progress! But the error now appears to be url-downloader refusing to proxy to a Cloud VPS public IP. I need to stare at the proxy config to confirm, but this may be as "easy" to fix as removing a banned destination block from the proxy config. [17:15:35] 10serviceops, 10Data-Persistence-Backup, 10GitLab (Infrastructure), 10Patch-For-Review, 10User-brennen: Backups for GitLab - https://phabricator.wikimedia.org/T274463 (10Dzahn) We used the installation of the restore script as a practical example in a Gerrit/code review session. We went through puppet c... [17:22:15] bd808: recently had an issue with planet fetching mixed URLs (most https but a few http) and http_proxy vs https_proxy. I ended up setting only HTTPS_PROXY as env var, not even trying to set both anymore. and I set that to https://url-downloader.%{::site}.wikimedia.org:8080 and that worked. of course that isn't the cloud VPS part though [17:24:55] I had been going back and forth a bit between the 4 variants as in https://stackoverflow.com/questions/58559109/difference-between-http-proxy-and-https-proxy [17:27:02] mutante: I found https://about.gitlab.com/blog/2021/01/27/we-need-to-talk-no-proxy/ last week and learned a lot about that pseudo standard as well. :) [17:29:37] heh, yea, it seems a bit undefined:) just saying that worked for me in wmf [17:30:23] as opposed to the "export https_proxy="$http_proxy"" we also see a lot in bashrc files etc [17:31:26] lol @ lowercase vs uppercase [18:09:51] <_joe_> bd808: my general recommendation on the topic is https://github.com/wikimedia/operations-docker-images-docker-pkg/blob/b32ad662bbf60fac69e0bd5d1724169cc6219d01/docker_pkg/image.py#L154-L163 [18:10:47] that's pretty close to what I'm doing now but without the uppercase variants. [18:11:28] It works in a python3 REPL from deploy1002.eqiad.wmnet, but is still failing from inside the toolhub namespace in the eqiad k8s cluster. [19:20:37] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10colewhite) [21:40:23] 10serviceops, 10MW-on-K8s, 10Shellbox: Support large files in Shellbox - https://phabricator.wikimedia.org/T292322 (10Legoktm) [21:41:42] the puppetmaster crons "sync-volatile" and "sync-ca" which copied data from active to passive master are now also replaced by timers. new units: sync-puppet-volatile and sync-puppet-ca [22:22:11] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Krinkle) [23:19:49] 10serviceops, 10Anti-Harassment, 10IP Info, 10SRE, 10Patch-For-Review: Update MaxMind GeoIP2 license key and product IDs for application servers - https://phabricator.wikimedia.org/T288844 (10Dzahn) After quite some necessary puppet changes (see above) we are now at a state where we could succesfully do...