[05:16:52] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10SRE, 10Traffic: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10jijiki) [06:17:23] 10serviceops, 10MW-on-K8s, 10SRE, 10MW-1.37-notes (1.37.0-wmf.20; 2021-08-23): Make HTTP calls work within mediawiki on kubernetes - https://phabricator.wikimedia.org/T288848 (10Joe) >>! In T288848#7357474, @Legoktm wrote: > We could teach MediaWiki how to use a transparent proxy instead, I'll poke at that... [07:06:08] good morning folks, any feedback about the desired service state in [07:06:11] https://gerrit.wikimedia.org/r/c/operations/puppet/+/721244/1/modules/profile/files/configmaster/disc_desired_state.py ? [07:06:22] (I am going through the icinga warnings/critical) [07:08:35] elukey: sorry, that was supposed to be on my todo list post-switchover [07:09:12] legoktm: nono please it is fine, all you folks did an amazing job, this is more a little cleanup, can happen anytime [07:09:26] also it is a little late for you, please don't check that code review now :D [07:09:53] too late :p [07:13:13] left 2 comments, otherwise looked correct to me [07:14:29] legoktm: I see, weird that eventgate-main is not pooled in codfw [07:14:45] it should be in theory [07:14:48] it's because of https://phabricator.wikimedia.org/T285710 [07:15:29] the wdqs updater currently depends on the topics coming from a topic like "codfw.mediawiki.revision-create" [07:16:07] and if its pooled in both DCs the topics will come from both codfw and eqiad, which throws it off...AIUI [07:16:55] ack, but in theory the "desired state" that we want is pooled for eventgate-main, even if there is this blocker (so a warning will stay there, hopefully fixed when the wdqs updater is fixed as well) [07:17:30] same for swift, IIRC it was not pooled in both DCs, is there anything ongoing? [07:17:35] (I was a bit puzzled about it) [07:18:32] yeah, swift is getting new hardware and currently rebalancing: https://phabricator.wikimedia.org/T288458 [07:18:48] for eventgate I can leave codfw=false with the task's link as comment [07:18:56] same thing for swift, ack :) [07:20:09] sounds good to me [07:23:29] so swift + swift-ro are codfw only, swift-rw is eqiad only [07:24:50] ack legoktm I'll keep working with EU folks, please enjoy your time off :) [07:26:55] jayme: good morning :D I am checking dns discovery for 'helm-charts' and 'docker-registry', they are both served by codfw only. Is it their desired state? [07:37:45] elukey: yes for docker-registry unfortunately, but helm-charts should be active/active [07:40:03] jayme: is there a comment that I can add to docker-registry? (so that people know its state etc.. even a task) [07:44:21] elukey: the reason for that is replication lag of swift [07:45:03] ah perfect, will add a comment - do we want to pool eqiad for helm-charts? [07:45:24] yeah, absolutely. [07:45:51] super [07:46:00] doing it now [07:46:04] thanks [07:48:25] done! [08:05:41] 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10MoritzMuehlenhoff) scandium has been upgraded. If tests are fine, I'd upload to apt.wikimedia.org [09:00:59] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) >>! In T290445#7355990, @akosiaris wrote: > And this doesn't add up. In grafana https://grafana.wikimedia.org/d/lxZAdAdMk/wikifeeds?viewPanel=15&orgId=1&f... [09:04:25] _joe_: quick question regarding envoy logging. Are we logging all failed requests or are we sampling? [09:10:32] we should be logging everything that is >=500 status code from what I see in the config, which makes the discrepancy I am seeing weird. Maybe it's operator (that's me error) though [09:10:48] <_joe_> all requests which resulted in a status code 500 or more [09:11:00] <_joe_> what are you seeing? [09:12:11] <_joe_> oh wait [09:12:41] I see wikifeeds having logged 30M+ times things like "upstream connect error or disconnect/reset before headers. reset reason: connection failure" or "upstream connect error or disconnect/reset before headers. reset reason: overflow" and at the same time the pod envoy having logged just 1.3M events [09:12:59] <_joe_> logged where? [09:13:03] logstash [09:13:14] <_joe_> so wait [09:13:26] <_joe_> that message (upstream connect error) is from envoy [09:13:34] <_joe_> which envoys are we talking about [09:13:53] <_joe_> usually you see that error on the downstream envoy, and not on the upstream one [09:14:08] <_joe_> if the problem is the usual persistent connection closure without acknowledgement [09:15:07] it's wikifeeds itself (the nodejs app) that is logging that. [09:15:28] <_joe_> that is a message that comes from envoy [09:15:31] so I am assuming the pod envoy is sending it (and not passing along the error from the upstream one) [09:15:45] upstream one being the API envoy in this case. [09:15:45] <_joe_> not necessarily [09:16:07] <_joe_> it might be the api sending that if the issue was an unresponsive backend [09:16:15] but even if it was the case, that does not explain the discrepancy of 30+M entries [09:16:17] <_joe_> but yes, typically it should be the local envoy [09:16:20] <_joe_> yes [10:08:50] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) The set of patches above should allow us to get wmerrors working; we can work on moving php7-fatal-error.php to mediawiki-config separately. [10:36:09] 10serviceops, 10MediaWiki-General, 10SRE, 10observability, and 2 others: MediaWiki Prometheus support - https://phabricator.wikimedia.org/T240685 (10Addshore) We would consider using this in some up coming work if it gets merged [12:53:44] 10serviceops, 10Maps, 10Patch-For-Review, 10User-jijiki: Deploy tegola-vector-tiles to kubernetes - https://phabricator.wikimedia.org/T283159 (10MSantos) p:05Triage→03High [13:02:09] 10serviceops, 10Maps, 10Product-Infrastructure-Team-Backlog, 10User-jijiki: Maps 2.0 roll-out plan - https://phabricator.wikimedia.org/T280767 (10MSantos) [13:13:16] jayme, akosiaris - qq about this past change https://gerrit.wikimedia.org/r/c/operations/puppet/+/677228 - for knative and kfserving (both in admin_ng) I am using the service secrets hiera config to deploy private tls data to deploy1002. While doing the last refactoring to split service secrets between main and ml-serve I realized that it would be nicer to get back the "admin" part. Would it [13:13:22] be ok or is it something that was removed for good? [13:15:20] it was "limited" for sure as an approach, which is why it was removed [13:15:43] the .hfenv files should definitely NOT be resurrected [13:16:01] nono sorry I meant adding the admin_services hiera config to profile::kubernetes::deployment_server::helmfile [13:16:08] but you care more about the scvname == 'admin' thing, right ? [13:16:35] yeah exactly, I'd need something like /etc/helmfile/private/admin/blabla [13:17:42] for what though ? [13:17:53] I see /etc/helmfile-defaults/private/knative-serving already [13:18:00] and kubeflow-kfserving in there [13:18:28] I mean, do those cover you needs? [13:19:19] akosiaris: for the moment yes, but I am working on https://gerrit.wikimedia.org/r/c/operations/puppet/+/720048, that may add some subdirs to the private dir [13:19:45] Janis is supporting my madness, the goal is to find a clean way to split between ml-services and services [13:20:05] but if we do it, then it would be nice to have an "admin" dir too in my opinion [13:20:45] or I am open to other roads, don't have a strong opinion [13:20:56] I'd just need to deploy ml-services via helmfile :D [13:21:06] (without messing up the actual status too much) [13:22:11] 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10ssastry) Thanks! I've started tests now. Will have results in about 10 hours. [13:25:04] * elukey can see Alex's love and tolerance for a lovely team like the ML one increasing by the minute [13:30:42] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10Performance-Team (Radar): Benchmark performance of MediaWiki on k8s - https://phabricator.wikimedia.org/T280497 (10jijiki) Last set of benchmarks of Round 1, we added a run with 6 pods x 8 workers: https://people.wikimedia.org/~jiji/benchmarks-bare... [13:49:39] elukey: I don't have a strong oppinion on that. IIRC the admin part was iunfortunately a lot of duplicate/very similar code as the services stuff [13:56:20] 10serviceops, 10Scap, 10Release-Engineering-Team (Doing): Deploy Scap version 4.0.0 - https://phabricator.wikimedia.org/T291095 (10hashar) [14:01:32] jayme: ack, if I can find a way to avoid duplication it may be a quick and painless thing to add [14:04:38] 10serviceops, 10SRE, 10Wikimedia-production-error: PHP7 corruption reports in 2020-2021 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Krinkle) [14:08:44] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [14:09:13] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) @WDoranWMF I'd ask you to please wait for the deployment until T290731 is resolved and a n... [14:16:10] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) Coming to logstash: right now on bare metal we rely the logs to rsyslogd talking to it via TCP on localhost. This is not possible on kubern... [14:16:51] switched mwmaint.discovery.wmnet (https://noc.wikimedia.org) from codfw to eqiad to go with the DC switch [14:17:11] unblocke reimaging mwmaint2002 with buster, unblocks the "migrate appservers to buster" tickets [14:20:55] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Daimona) >>! In T271736#7337025, @Reedy wrote: > I guess running some PHP 8 esque code scanners over the branch of ruflin/Elastica and elasticsearch/elasticsearch-php should help confirm easily eno... [14:26:01] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:15] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:30] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:26:48] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Joe) [14:28:23] 10serviceops, 10Performance-Team, 10Developer Productivity: Update php-wmerrors page to include request ID - https://phabricator.wikimedia.org/T291192 (10Krinkle) [14:29:25] jayme: I added a proposal to the puppet change to include also a "admin_ng_services" directory [14:29:43] no code repetition, just a simple hiera addition [14:29:58] in theory it should allow a very clean separation [14:32:46] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [14:38:46] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05Stalled→03In progress [14:38:54] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Dzahn) [14:39:16] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05Stalled→03In progress [14:39:24] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [14:50:16] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Daimona) I ran phan 5.2.1 on ruflin/Elastica 6.1.5, using PHP 8.0.1. There are quite a lot of issues, mostly type mismatches (the project doesn't use any static analyzer to validate types). The onl... [15:05:52] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10Krinkle) There is also monolog [ErrorLogHandler](https://github.com/Seldaek/monolog/blob/2.3.4/src/Monolog/Handler/ErrorLogHandler.php) which mi... [15:19:06] ok so the current service state diff is [15:19:06] The following objects are not in their desired state: {'eqiad': {'pooled': True, 'references': [], 'ttl': 300}, 'tags': 'dnsdisc=swift'} {'eqiad': {'pooled': True, 'references': [], 'ttl': 300}, 'tags': 'dnsdisc=swift-ro'} [15:19:32] but this is true since we are rebalancing, it should be fixed when we'll pool eqiad (IIUC) [15:19:33] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [15:19:52] Cc: legoktm: --^ [15:20:11] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) https://noc.wikimedia.org (mwmaint.discovery.wmnet) has been switched from codfw to eqiad. mwmaint2002 has been upgraded to buster. monitoring all green. [15:20:31] 10serviceops, 10SRE, 10Patch-For-Review: upgrade mwmaint servers to buster - https://phabricator.wikimedia.org/T267607 (10Dzahn) 05In progress→03Resolved [15:20:55] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) [15:22:38] 10serviceops, 10SRE, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Upgrade MediaWiki clusters to Debian Buster (debian 10) - https://phabricator.wikimedia.org/T245757 (10Dzahn) 05In progress→03Resolved https://noc.wikimedia.org (mwmaint.discovery.wmnet) has been switched from codfw to eqiad.... [15:22:46] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Dzahn) [15:25:10] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability, 10Patch-For-Review: Make logging work for mediawiki in k8s - https://phabricator.wikimedia.org/T288851 (10akosiaris) > via TCP on localhost. UDP not TCP, (I am just being pedantic, I know). [15:30:38] 10serviceops, 10Prod-Kubernetes, 10Kubernetes: kube-apiserver need to reach webhooks running inside of the cluster - https://phabricator.wikimedia.org/T290967 (10akosiaris) [15:39:09] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Reedy) On the `Match` class... The comment in T268861#7359008 is relevant. https://github.com/ruflin/Elastica/blob/6.1.5/composer.json#L15 does say `require` ` "php": "^7.0",`, and the subclassing... [15:44:57] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [15:46:48] 10serviceops, 10MW-on-K8s, 10Performance-Team, 10Release-Engineering-Team, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10dancy) [16:00:07] elukey: perfect, thanks :) [16:02:35] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Daimona) M2C. From what I'm seeing at #php_8.0_support (issues with external dependencies, tests still broken, unforeseen issues, etc.), it seems like the migration to PHP 8 won't be painless. I pe... [16:14:39] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Reedy) 05Stalled→03Open As {T245757} is done, now is the time to look at moving this forward. I guess I can mark it open again! 7.3 is in buster, but as we package our own version anyway (for... [16:14:41] 10serviceops, 10PHP 7.2 support, 10Patch-For-Review: Drop PHP 7.2 support from MediaWiki master branch, once Wikimedia production is on 7.3 - https://phabricator.wikimedia.org/T261872 (10Reedy) [16:21:46] 10serviceops: Migrate WMF Production from PHP 7.2 to a newer version - https://phabricator.wikimedia.org/T271736 (10Ladsgroup) I guess this needs to be addressed too {T291127} but someone needs to actually do it [16:31:36] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) @Daimona can you let @hnowlan once everything is ready to be deployed and please update... [16:33:54] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) >>! In T285857#7359608, @WDoranWMF wrote: > @Daimona can you let @hnowlan once everything... [19:29:49] legoktm: [not service-ops specific] \o quick question about https://gerrit.wikimedia.org/r/c/operations/puppet/+/720102/comment/fd42f2d9_376da884/ [19:30:14] my confusion came from thinking that `ensure` would always be sent to present here: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/systemd/manifests/timer/job.pp#118 [19:30:49] but it looks like really `present` is just what that gets set to if not otherwise set? and if so is that just a result of puppet's scoping rules or something else? [19:31:16] s/always be sent to present/always be set to present [19:47:02] * legoktm looks [19:47:17] ryankemper: yeah, present is just the default value [21:48:25] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10WDoranWMF) @ldelench_wmf @NRodriguez awesome, we'll our best to get it done as soon as it is unbloc... [23:28:47] 10serviceops, 10SRE: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10ssastry) Good to go!