[00:28:59] Legoktm: I just replaced it with the pair on non-meta-packages (python and pkg-config) [00:47:47] 10serviceops, 10MW-on-K8s, 10SRE: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) >>! In T288848#7292923, @TK-999 wrote: > For the record, to resolve the same issue during our effort to upgrade Fandom's MW-on-k8s deployment, we ended up creating an... [00:53:57] James_F: hmm, python-pkgconfig is a Python library that wraps pkg-config, so my guess is the proper dependency all along was python, pkg-config, not that library which just happened to pull in the correct dependencies [00:55:28] Yeah. [00:55:52] Not sure exactly what things were using it; it’s copy-pasted into a bunch of places. [00:56:24] Switching to the pair worked for the three repos for which I wrote patches. [00:58:53] cool [01:24:54] 10serviceops, 10Discovery, 10Wikimedia-Site-requests, 10Technical-Debt: Consider splitting search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Legoktm) [06:55:24] 10serviceops, 10Discovery, 10Discovery-Search, 10Wikimedia-Site-requests, 10Technical-Debt: Consider splitting search.wikimedia.org out of ops/mediawiki-config into separate service - https://phabricator.wikimedia.org/T289224 (10Gehel) [07:19:27] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) a:03JMeybohm [08:42:35] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) [08:43:04] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) [08:47:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) [09:08:05] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Enable the Priority admission plugin - https://phabricator.wikimedia.org/T289131 (10JMeybohm) After enabling the admission plugin in staging-codfw I and deleting Pods that do define a priorityClass, the priority is added correctly: ` # kub... [09:31:52] 10serviceops, 10observability, 10GitLab (Initialization), 10Patch-For-Review: Define monitoring for gitlab - https://phabricator.wikimedia.org/T275170 (10Jelto) I would like to either finish this task or add additional requirements. Currently we are collecting metrics of all GitLab components on `gitlab100... [10:57:32] jayme: per your suggestion, I'll try to replicate the issue with Flink on staging [10:57:34] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713830 [11:51:47] 10serviceops, 10SRE: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10fgiunchedi) Two cents re: metrics/alerting, we have the prometheus pushgateway available which seems like a good fit (more info: https://wikitech.wikimedia.org/wiki/Prometheus#Ephemeral_jobs_(Pushgateway)) [13:49:55] ok, I reproduced the issue on staging - cannot connect on port 6123 to current (which I verified as well) job manager on port [13:49:57] 6123 [13:57:16] ah, only for the moment at least, after that there's a flurry of other issues, maybe related to one pod that doesn't seem to work there [14:05:46] 10serviceops, 10Patch-For-Review, 10User-jijiki: Productionise mc10[37-54].eqiad.wmnet - https://phabricator.wikimedia.org/T278225 (10jijiki) [14:36:49] zpapierski: did you open a task yesterday ? [14:49:38] nope, I did not, I have a running task for that I mentioned yesterday, I planned to add data there today after reproducing on staging, but that proved to be difficult [14:50:24] I just repeated that on codfw, this time also recording the state of pods (which I was missing yesterday), and I'm going to add comments based on that [14:58:56] zpapierski: I uploaded another patch to switch logging back to INFO [14:59:08] I am waiting for all clusters to sync [14:59:13] I was planning to do the same thing [14:59:14] thanks [15:00:21] sync on staging is broken I think - one pod, specific to that env is in constant state of error [15:01:16] I am waiting for eqiad to sync, but it takes forever, I suspect it will time out [15:02:32] huh, didn't had any issues with that [15:08:03] οκ let's wait [15:27:27] zpapierski: we can't have 3 taskmanagers on teh staging cluster [15:27:54] so I will have to revert taht back to 1 [15:29:32] ah, that explains probably my issues with reproducing what we have in prod [15:29:39] why can't we? [15:31:02] because its resources are limited [15:31:15] I am merging https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713891 [15:31:15] I see [16:11:54] zpapierski: everything is back to how it was, I think you should consider using a value for logging level [16:12:01] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review: Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Legoktm) a:03Legoktm >>! In T288848#7293721, @Legoktm wrote: > One other consideration is whether we need to specifically route index.php and api.php re... [16:12:13] so you wont have to bump the chart version [16:12:26] I can change values without that? [16:12:30] didn't know that [16:13:15] is there anything I could do with log level that wouldn't require a CR process? [16:18:18] no :p [16:18:38] let's have a look tomorrow [17:42:13] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [17:43:21] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Shrink redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) As expected: {F34606863} [18:27:37] 10serviceops, 10SRE, 10Patch-For-Review: Replace rdb1005, rdb1006 with rdb1011, rdb1012 - https://phabricator.wikimedia.org/T281217 (10Krinkle) [18:33:04] 10serviceops, 10Release Pipeline: Production buster-nodejs10-devel image has npm 5.x, which is not actually compatible with node 10.x - https://phabricator.wikimedia.org/T284112 (10Jdforrester-WMF) 05Open→03Resolved a:03Jdforrester-WMF So this is effectively Resolved via {T284346}, which is a reasonable... [18:58:17] 10serviceops, 10SRE, 10Performance-Team (Radar), 10User-jijiki: Reduce number of shards in redis_sessions cluster - https://phabricator.wikimedia.org/T280582 (10jijiki) [20:01:58] anyone around to take a look at https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713934 ? should be a no-op, switches the hardcoded `thorium.eqiad.wmnet` to use the equivalent `analytics-web` CNAME [20:02:47] this will be a no-op and then when we cut over from thorium for phabricator.wikimedia.org/T285355 (probably next monday) we'll need to change where the analytics-web CNAME is pointing [20:24:57] ryankemper: seems fine, have you checked that the new host is listed in the egress config? [20:26:15] legoktm: where's the egress config live? [20:26:59] same file, later down [20:27:00] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/deployment-charts/+/refs/heads/master/helmfile.d/services/linkrecommendation/values.yaml#88 [20:28:33] I would suggest listing all the possible analytics-web hosts there so that way you can change the CNAME without needing to update the egress each time [20:28:51] like how all the various dbproxies are listed [20:30:04] legoktm: yeah my understanding is the analytics web host is always going to be a single backend (well, always for the foreseeable future) [20:30:24] so yeah i'll want to add an entry for `an-web1001` at this point, and then after a successful cutover when we bring thorium out of service we can remove the egress then [20:30:35] :thumbsup: [20:32:38] I codesearched and don't see any other references to thorium in that repo [20:34:23] legoktm: thanks. `an-web1001` egress changes up now: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/713934 [20:37:49] ryankemper: lgtm, +1'd. I edited the commit message a bit too [20:38:02] ty [23:36:17] 10serviceops, 10SRE: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) Thanks for the pointer! I think if we wanted to track metrics from each run, like request latency or number of passing assertions or something, pushgateway would be the tool for the job -- but I think we don't...