[05:18:27] 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) @fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs. My idea would be... [06:08:34] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) > ...given there could be several dozens of such very small services I [[https://logstash.wikimedia.org/goto/9f46bba4ed0d64bf14926cdb13d53561|searched t... [07:10:48] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7245477, @Legoktm wrote: >> ...given there could be several dozens of such very small services > > I [[https://logstash.wikimedia.org/goto/9f... [07:21:45] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245525, @Joe wrote: > I looked at the query you linked, and I don't think you should exclude the `scripts/` directory, or am I missing so... [07:35:31] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7245532, @Legoktm wrote: >>>! In T261277#7245525, @Joe wrote: >> I looked at the query you linked, and I don't think you should exclude the `s... [07:39:05] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245543, @Joe wrote: >>>! In T261277#7245532, @Legoktm wrote: >> Only Score shells out to paths with scripts/ AFAIS. Here's the query with... [07:48:57] 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) I love "shellboxes out". Thanks I remember we discussed managing scripts explicitly when we introduced shellbox, so I was wondering why they were executed lo... [08:37:32] 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10LSobanski) Adding @nskaggs and @Bstorm for visibility. [08:37:48] 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10jcrespo) [08:54:53] 10serviceops, 10Release Pipeline, 10Services, 10Patch-For-Review: Provide a node 12 production image (based on bullseye?) - https://phabricator.wikimedia.org/T284346 (10Jdforrester-WMF) [09:02:25] _joe_: any objections if I start working setting up the shellbox constraints deployment tomorrow using LVS? [09:07:20] <_joe_> legoktm: I think it's wasted effort on your part tbh [09:07:37] <_joe_> but ok, go on :) [09:07:55] <_joe_> btw, take a look at my last few commits in ops/private [09:08:04] <_joe_> not now, now go to bed [09:08:04] It mostly means we can have it running in the next few days instead of waiting a while [09:08:22] I already shutdown my laptop so I'll see tomorrow :) [09:08:41] <_joe_> yeah I am aware, btw for that deployment we probably want the vanilla shellbox image, and not the score one :) [09:08:46] <_joe_> tty tomorrow! [09:51:49] 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) >>! In T287574#7245440, @Marostegui wrote: > @fgiunchedi we'd need to coordinate this in a way as this would arrive to all h... [10:05:09] hi, we're investigating increased latencies in cirrus indexing jobs since the switch to eqiad, I see a weird job processing rates (jun 28 -> jul 2) https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=66&orgId=1&from=1624649443213&to=1625722004734&var-dc=codfw%20prometheus%2Fk8s [10:05:27] s/the switch to eqiad/the switch to codfw/ [10:06:28] do we know what caused the processing rate to be so low during this period (switch -> jul 2) and suddenly went back to normal? [10:51:15] <_joe_> looks like RecordLintJob suddenly reappeared [10:52:03] <_joe_> the numbers were similar in eqiad before the switch [10:52:10] <_joe_> so i don't see anythung strange there [10:53:43] <_joe_> also remember, this is related to the changeprop instance where jobs are being processed, not where the jobrunners are located [10:54:41] <_joe_> https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=66&orgId=1&from=1624649443213&to=1625722004734&var-dc=eqiad%20prometheus%2Fk8s shows the rate consumed by changeprop in eqiad [11:14:52] ok, cirrusElasticWrite also bumps from 185 to 280 at the same time [11:15:50] if I want to rule out that changeprop is not reaching its max cap would there be a graph that shows this? [11:18:57] I see that mem is quite high here https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=69&orgId=1&from=1624907857651&to=1627553846543&var-dc=codfw%20prometheus%2Fk8s same period (switch -> jul 2) [11:21:25] esp. working set: changeprop-production [11:48:17] <_joe_> we did perform a rolling restart at some point, if memory doesn't fail me - you might check SAL (sorry on the phone @lunch) [12:19:28] survey - is it ok to drop k8s 1.12 support from deployment-charts? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708725 [12:19:40] (more specifically, the kubeyaml checks in CI) [13:17:56] <_joe_> elukey: I would say yes [13:26:43] FYI, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/702117 in a few minutes (initially on conf1004 to confirm if everything is fine) [13:44:38] <_joe_> moritzm: ack, I'm around if you need assistance with testing [13:45:53] it's enabled on conf1004 (and currently being enabled on 1005), and so far it seems all good to me, will proceed with 2004 in a bit [13:51:07] _joe_: actually I'm seeing pybal errors in Icinga for lvs1014/1015/1016, I'll go revert for now [13:51:20] <_joe_> wait [13:51:24] ok [13:51:39] or do I need to bounce it to reconnect or so? [13:51:56] <_joe_> I think we need to restart pybal yes [13:52:32] <_joe_> let me try on 1016 [13:52:43] ack [13:53:30] icinga for conf1004/1005 look all fine, so nginx feature-wise it seems all fine [13:54:03] <_joe_> yeah you might need to restart pybals that are connected to a specific etcd [13:56:42] re: changeprop: so I don't see a restart but I see something weird at the time the processing rate bumped: [13:56:44] https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221624924800000%22,%221625615999000%22,%22codfw%20prometheus%2Fk8s%22,%7B%22expr%22:%22irate(cpjobqueue_normal_rule_processing_count%7Brule%3D%5C%22cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite%5C%22%7D%5B5m%5D)%22,%22requestId%22:%22Q-65e8a73d-96da-477c-ad41-b675de345c2d-0A%22%7D%5D [13:58:45] <_joe_> dcausse: I remember we did something re: kafka after tjhe switchover because we were having some throughput issues, and I don't remember if we rebalanced topics or just restarted cpjobqueue [13:59:00] <_joe_> but I think you're chasing a ghost atm, I don't think the issue is cpjobqueue [13:59:21] <_joe_> and what you show there is exactly a restart of the pods [13:59:36] <_joe_> we have new pods in place of a couple ones that were performing very badly [13:59:41] <_joe_> I guess we just removed them [13:59:43] Hi, we would like to increase the staging replicas for tegola vector tiles because we are benchmarking it with some production traffic from maps. Any thoughts/objections to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708768 ? [14:00:10] <_joe_> nemo-yiannis: uh why benchmarking in staging? that's a couple of underpowered VMs [14:00:23] <_joe_> you should point production traffic to the eqiad/codfw clusters instead [14:00:27] because we only have staging at the moment [14:00:45] <_joe_> well then you need to deploy to the main clusters before doing proper benchmarking [14:01:23] <_joe_> more importantly, staging can be turned off for maintenance if we need to, with the assumption that no production traffic goes there [14:02:08] we are not sending production traffic, we are mirroring some requests to see how tegola responds [14:03:13] <_joe_> oh I see [14:03:24] <_joe_> well, I don't think there is much space on those VMs [14:03:47] <_joe_> so it's very possible your attempt will fail to allocate all pods [14:04:00] ok, got it. should be fine to wait, we already have better understanding even with those 3 pods. [14:04:04] <_joe_> anyways, no opposition if it's temporary, but your benchmarks will be 20-30% off [14:04:17] <_joe_> I can tell you from experience of running mediawiki there [14:04:51] <_joe_> I was getting horrible latencies in my benchmarks, then just moved to *pods of the same size* but in the prod cluster, and the latency went back to what I expected [14:06:15] <_joe_> nemo-yiannis: actually, I think I can help you, I should remove mwdebug from staging, it's a pretty large pod and I don't reallly need it to be there anymore [14:06:34] <_joe_> so, bump to 6 pods, if it fails, let me know :) [14:07:00] awesome, thanks! [14:07:32] <_joe_> but again, be careful with the results, it's possible things will be better significantly on the production cluster [14:12:56] _joe_: ok, for context the problem is a topic that is frequently backlogged since the switch ( https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=main-codfw&var-topic=codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite&var-consumer_group=All&from=now-30d&to=now ) [14:13:23] <_joe_> sop the huge initial backlog was due to those failing pods [14:13:48] I'm looking into 1/ more messages genarally produced to it after the switch or 2/ jobqueue overloaded [14:14:08] <_joe_> not sure about the rest, but I'm also alone from my team today, and hugh from PET is not around either so I don't have much spare time to investigate [14:14:26] <_joe_> I would suspect a kafka server with that topic being slightly overloaded could explain the behaviour too [14:14:38] sure, np thanks for responding already :) [14:44:31] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) >>! In T279309#7242742, @Dzahn wrote: > @wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a... [15:58:58] 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) I uploaded a very simplistic script that could be used as a systemd timer, or invoked by... [16:00:34] 10serviceops, 10MW-on-K8s, 10SRE: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe) [16:25:32] 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Andrew) Restarting haproxies in wmcs is fairly harmless, just ping when it's time. [17:22:22] did something change mariadb authentication? first i got an error that the testreduce nodejs client doesn't support authentication mechanisms for mariadb ... i upgraded the library and now i get a 'ER_ACCESS_DENIED_NO_PASSWORD_ERROR' ... this is on testreduce1001.eqiad.wmnet [17:26:07] <_joe_> subbu: this is probably not the right channel to ask, but it's possible some database in eqiad is under maintenance [17:26:20] <_joe_> but to your question: nothing I'm aware of changed [17:26:34] ok ..what is the dba channel? [17:27:05] oh .. actually this is not a production db. [17:27:50] it is a local db on testreduce1001 .. so i woinder if some puppet patch accidentally modified the password. [17:28:42] db puppetization doesn't tend to modify things automatically [17:31:22] hmm .. i cannot connect with the mysql commandline client either now ("Access denied for user 'testreduce'@'localhost'") .. so something changed on that server. [17:32:24] https://sal.toolforge.org/log/aCCp8XoB1jz_IcWufNo8 maybe? [18:35:20] <_joe_> yeah that seems likely [18:35:28] <_joe_> subbu: where can I find the credentials on disk? [21:38:23] 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10dancy) @Joe Please try docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2021-07-29-2046... [23:05:11] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jclark-ctr) @Dzahn Thanks! putting the rest in A will speed up racking currently i was waiting on rack C