[08:27:48] 10serviceops, 10Wikidata-Query-Service: Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10dcausse) [08:32:08] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: The mediawiki-webserver image should only log in json format - https://phabricator.wikimedia.org/T285384 (10Joe) 05Open→03Resolved [09:03:41] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [09:04:14] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [09:10:20] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Michael) Is there an update on this? Anything we (WMDE) can do to help this move forward? [09:11:02] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) I'm on it [09:16:32] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [09:24:02] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [09:41:30] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1285.eqiad.wmnet` - m... [10:13:27] jayme, _joe_ - for the iptables issue I think that we could simply create a component/iptables for buster, import packages from backports and use apt_from_component in puppet [10:13:39] (no moving targets, easier install process, etc..) [10:13:42] ok for you? [10:17:41] <_joe_> elukey: sure, it's not even really a debate I think :) [10:18:00] perfect, going to file code changes :) [10:39:11] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1269.eqiad.wmnet` - m... [11:01:43] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [11:48:38] elukey: I'm afraid I'm lacking context, but it sounds okay :p [12:25:24] jayme: ah sorry! So do you recall the performance regression on ml-serve-ctrl nodes? Basically it was due to a bug (https://github.com/kubernetes/kubernetes/issues/82361), the same iptables rules are added over and over ending up in degrading network performances [12:25:56] the fix is to upgrade iptables, for us it means using the buster backports version [12:26:20] (if affects only buster nodes) [12:48:28] ah, okay. Great! [12:48:44] How did you figure that out in the end? [12:50:47] all tracked from https://phabricator.wikimedia.org/T287238#7236456 - I found only a weird ping latency pattern, and Cathal did the real network investigation magic :) [12:53:25] * jayme reading [12:53:27] thanks [13:11:12] jayme: https://gerrit.wikimedia.org/r/c/operations/puppet/+/708258 is pending, when you have a moment can you let me know if it is the right place or not? [13:11:26] after that I'll roll it out on ml-serve nodes [13:14:23] I did some spelunking in netfilter git and it might be as simple as backporting http://git.iptables.org/iptables/commit/?id=f7fa88020f3bc4ec646ce2a48731a1f5fa2aa0a9 on top of what's in buster [13:14:24] (I may have to tweak the iptables package list with its apt-cache deps, but it is a minor nit, working on it) [13:14:29] can we easily repro this? [13:14:52] we can yes [13:15:17] it takes a bit but without the right iptables the number of rules for ipv4 grows over time [13:15:23] *grow [13:15:50] moritzm: using 1.8.5 should be fine though, any concerns?? [13:17:55] no, 1.8.5 is certainly fine for now, just proceed with the ML nodes. but it would also be good to actually get to the bottom of this and make sure that it's fixed in buster (since the patch seems sane enough for a point release), after all there's also going to be an update of the main k8s workers to buster at some point [13:18:21] ah yes righ [13:18:24] *right [13:18:50] there was a debian bug report mentioned in the gh issue, but nobody backported a fix to buster [13:18:53] afaics [13:20:18] I'll have a quick look later [13:21:27] moritzm: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=939924 [13:26:11] hiya, trying to recall how to explicitly choose a release to deploy with helmfil [13:26:12] e [13:26:21] i guess i need to pass --name=canary to helm [13:26:26] can I do that through helmfile/ [13:27:53] uhhh after 10 minutes of reading --help and googling, i post my question, and now i see the --selector [13:27:55] nm [13:27:58] --selector name=canary [14:35:45] ottomata: I think you where looking for this https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Deploying_with_helmfile :p [14:35:56] sorry for being late :) [14:37:14] that's it hanks! [14:55:01] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10dcausse) p:05Triage→03High [15:01:21] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Joe) I think there are two options, depending on the level of security we want to achieve and the urg... [15:01:32] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10JMeybohm) a:03JMeybohm Looking into this. Problem is that we currently do not allow Pods... [15:34:39] 10serviceops, 10observability, 10GitLab (Initialization), 10Patch-For-Review: Define monitoring for gitlab - https://phabricator.wikimedia.org/T275170 (10Jelto) The scrape configuration for GitLab is in place and Prometheus collects metrics. I imported and adapted the [upstream GitLab Grafana dashboads](... [15:49:36] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Ladsgroup) >>! In T285104#7239822, @Joe wrote: > I think there are two options, depending on the leve... [15:54:34] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) a:03Joe Status update: the mwdebug installation is now reachable from external users via the Wikimedia Debug browser extensio... [16:10:50] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [16:12:46] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [16:14:29] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a mwdebug deployment for mediawiki on kubernetes - https://phabricator.wikimedia.org/T283056 (10Joe) [16:15:08] 10serviceops, 10SRE, 10Traffic, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Joe) 05Open→03Resolved a:03Joe [16:15:54] 10serviceops, 10SRE, 10Traffic, 10Patch-For-Review, 10User-jijiki: Access mwdebug kubernetes deployment via the 'X-Wikimedia-Debug' header - https://phabricator.wikimedia.org/T286491 (10Joe) [16:26:55] 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10Joe) [16:27:20] 10serviceops, 10MW-on-K8s, 10SRE, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10Joe) p:05Triage→03Medium a:05Joe→03dancy [17:16:11] 10serviceops, 10Wikidata, 10Wikidata-Query-Service, 10wdwb-tech, 10Discovery-Search (Current work): Flink jobmanager and taskmanager cannot talk to the k8s api server - https://phabricator.wikimedia.org/T287443 (10dcausse) I'm now getting: ` {"@timestamp":"2021-07-27T16:59:20,553","log.level":"ERROR","m... [18:09:59] 10serviceops, 10SRE, 10Traffic, 10Datacenter-Switchover: During DC switch, helm-charts failed verification because it doesn't have a service IP - https://phabricator.wikimedia.org/T285707 (10Legoktm) p:05Triage→03High [18:10:24] 10serviceops, 10MW-on-K8s, 10SRE: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10Krinkle) [18:14:46] 10serviceops, 10MW-on-K8s, 10SRE: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10Legoktm) p:05Triage→03Medium [20:58:42] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) >>! In T285104#7240005, @Michael wrote: >>>! In T285104#7239822, @Joe wrote: >> * How string... [21:01:30] 10serviceops, 10SRE, 10Services, 10Wikibase-Quality-Constraints, and 3 others: Deploy Shellbox instance (shellbox-constraints) for Wikidata constraint regexes - https://phabricator.wikimedia.org/T285104 (10Legoktm) >>! In T285104#7239822, @Joe wrote: > I think there are two options, depending on the level... [21:17:26] 10serviceops, 10Wikimedia-Logstash, 10observability, 10GitLab (Initialization), and 2 others: Logging for GitLab - https://phabricator.wikimedia.org/T274462 (10colewhite) 05Open→03Resolved a:03colewhite Gitlab logs are now in Logstash. \o/ [21:53:04] 10serviceops, 10MW-on-K8s, 10SRE: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10Legoktm) Currently scap generates the GitInfo "cache" files, see https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/tools/scap/+/refs/heads/master/scap/tasks.py#151 and... [21:55:49] 10serviceops, 10MW-on-K8s, 10SRE: GitInfo is missing from mwdebug-kubernetes deployment - https://phabricator.wikimedia.org/T287512 (10dancy) a:03dancy I can take this. [22:15:44] 10serviceops, 10MW-on-K8s, 10SRE: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10jeena) [22:16:26] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Doing): Check out www-portals repo in the mediawiki-webserver and in the mediawiki-multiversion images - https://phabricator.wikimedia.org/T285325 (10jeena) 05Open→03Resolved a:03jeena We had updated the jenkins confi... [23:30:43] 10serviceops, 10Wikimedia-Logstash, 10observability, 10GitLab (Initialization), and 2 others: Logging for GitLab - https://phabricator.wikimedia.org/T274462 (10brennen) Thanks for all of the assistance!