[05:18:27] <wikibugs>	 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Marostegui) @fgiunchedi we'd need to coordinate this in a way as this would arrive to all hosts as soon as puppet runs. My idea would be...
[06:08:34] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) > ...given there could be several dozens of such very small services  I [[https://logstash.wikimedia.org/goto/9f46bba4ed0d64bf14926cdb13d53561|searched t...
[07:10:48] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7245477, @Legoktm wrote: >> ...given there could be several dozens of such very small services >  > I [[https://logstash.wikimedia.org/goto/9f...
[07:21:45] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245525, @Joe wrote: > I looked at the query you linked, and I don't think you should exclude the `scripts/` directory, or am I missing so...
[07:35:31] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) >>! In T261277#7245532, @Legoktm wrote: >>>! In T261277#7245525, @Joe wrote: >> I looked at the query you linked, and I don't think you should exclude the `s...
[07:39:05] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Legoktm) >>! In T261277#7245543, @Joe wrote: >>>! In T261277#7245532, @Legoktm wrote: >> Only Score shells out to paths with scripts/ AFAIS. Here's the query with...
[07:48:57] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Create a gateway in kubernetes for the execution of our "lambdas" - https://phabricator.wikimedia.org/T261277 (10Joe) I love "shellboxes out". Thanks I remember we discussed managing scripts explicitly when we introduced shellbox, so I was wondering why they were executed lo...
[08:37:32] <wikibugs>	 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10LSobanski) Adding @nskaggs and @Bstorm for visibility.
[08:37:48] <wikibugs>	 10serviceops, 10DynamicPageList (Wikimedia), 10PoolCounter, 10SRE, and 9 others: Limit concurrency of DPL queries - https://phabricator.wikimedia.org/T263220 (10jcrespo)
[08:54:53] <wikibugs>	 10serviceops, 10Release Pipeline, 10Services, 10Patch-For-Review: Provide a node 12 production image (based on bullseye?) - https://phabricator.wikimedia.org/T284346 (10Jdforrester-WMF)
[09:02:25] <legoktm>	 _joe_: any objections if I start working setting up the shellbox constraints deployment tomorrow using LVS?
[09:07:20] <_joe_>	 legoktm: I think it's wasted effort on your part tbh
[09:07:37] <_joe_>	 but ok, go on :)
[09:07:55] <_joe_>	 btw, take a look at my last few commits in ops/private
[09:08:04] <_joe_>	 not now, now go to bed
[09:08:04] <legoktm>	 It mostly means we can have it running in the next few days instead of waiting a while
[09:08:22] <legoktm>	 I already shutdown my laptop so I'll see tomorrow :)
[09:08:41] <_joe_>	 yeah I am aware, btw for that deployment we probably want the vanilla shellbox image, and not the score one :)
[09:08:46] <_joe_>	 tty tomorrow!
[09:51:49] <wikibugs>	 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10fgiunchedi) >>! In T287574#7245440, @Marostegui wrote: > @fgiunchedi we'd need to coordinate this in a way as this would arrive to all h...
[10:05:09] <dcausse>	 hi, we're investigating increased latencies in cirrus indexing jobs since the switch to eqiad, I see a weird job processing rates (jun 28 -> jul 2) https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=66&orgId=1&from=1624649443213&to=1625722004734&var-dc=codfw%20prometheus%2Fk8s 
[10:05:27] <dcausse>	 s/the switch to eqiad/the switch to codfw/
[10:06:28] <dcausse>	 do we know what caused the processing rate to be so low during this period (switch -> jul 2) and suddenly went back to normal? 
[10:51:15] <_joe_>	 looks like RecordLintJob suddenly reappeared
[10:52:03] <_joe_>	 the numbers were similar in eqiad before the switch
[10:52:10] <_joe_>	 so i don't see anythung strange there
[10:53:43] <_joe_>	 also remember, this is related to the changeprop instance where jobs are being processed, not where the jobrunners are located
[10:54:41] <_joe_>	 https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=66&orgId=1&from=1624649443213&to=1625722004734&var-dc=eqiad%20prometheus%2Fk8s shows the rate consumed by changeprop in eqiad
[11:14:52] <dcausse>	 ok, cirrusElasticWrite also bumps from 185 to 280 at the same time
[11:15:50] <dcausse>	 if I want to rule out that changeprop is not reaching its max cap would there be a graph that shows this?
[11:18:57] <dcausse>	 I see that mem is quite high here https://grafana-rw.wikimedia.org/d/LSeAShkGz/jobqueue?viewPanel=69&orgId=1&from=1624907857651&to=1627553846543&var-dc=codfw%20prometheus%2Fk8s same period (switch -> jul 2)
[11:21:25] <dcausse>	 esp. working set: changeprop-production
[11:48:17] <_joe_>	 we did perform a rolling restart at some point, if memory doesn't fail me - you might check SAL (sorry on the phone @lunch)
[12:19:28] <elukey>	 survey - is it ok to drop k8s 1.12 support from deployment-charts? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708725
[12:19:40] <elukey>	 (more specifically, the kubeyaml checks in CI)
[13:17:56] <_joe_>	 elukey: I would say yes
[13:26:43] <moritzm>	 FYI, I'll merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/702117 in a few minutes (initially on conf1004 to confirm if everything is fine)
[13:44:38] <_joe_>	 moritzm: ack, I'm around if you need assistance with testing 
[13:45:53] <moritzm>	 it's enabled on conf1004 (and currently being enabled on 1005), and so far it seems all good to me, will proceed with 2004 in a bit
[13:51:07] <moritzm>	 _joe_: actually I'm seeing pybal errors in Icinga for lvs1014/1015/1016, I'll go revert for now
[13:51:20] <_joe_>	 wait
[13:51:24] <moritzm>	 ok
[13:51:39] <moritzm>	 or do I need to bounce it to reconnect or so?
[13:51:56] <_joe_>	 I think we need to restart pybal yes
[13:52:32] <_joe_>	 let me try on 1016
[13:52:43] <moritzm>	 ack
[13:53:30] <moritzm>	 icinga for conf1004/1005 look all fine, so nginx feature-wise it seems all fine
[13:54:03] <_joe_>	 yeah you might need to restart pybals that are connected to a specific etcd
[13:56:42] <dcausse>	 re: changeprop: so I don't see a restart but I see something weird at the time the processing rate bumped:
[13:56:44] <dcausse>	 https://grafana-rw.wikimedia.org/explore?orgId=1&left=%5B%221624924800000%22,%221625615999000%22,%22codfw%20prometheus%2Fk8s%22,%7B%22expr%22:%22irate(cpjobqueue_normal_rule_processing_count%7Brule%3D%5C%22cirrusSearchElasticaWrite-cpjobqueue-partitioned-mediawiki-job-cirrusSearchElasticaWrite%5C%22%7D%5B5m%5D)%22,%22requestId%22:%22Q-65e8a73d-96da-477c-ad41-b675de345c2d-0A%22%7D%5D 
[13:58:45] <_joe_>	 dcausse: I remember we did something re: kafka after tjhe switchover because we were having some throughput issues, and I don't remember if we rebalanced topics or just restarted cpjobqueue
[13:59:00] <_joe_>	 but I think you're chasing a ghost atm, I don't think the issue is cpjobqueue
[13:59:21] <_joe_>	 and what you show there is exactly a restart of the pods
[13:59:36] <_joe_>	 we have new pods in place of a couple ones that were performing very badly
[13:59:41] <_joe_>	 I guess we just removed them
[13:59:43] <nemo-yiannis>	 Hi, we would like to increase the staging replicas for tegola vector tiles because we are benchmarking it with some production traffic from maps. Any thoughts/objections to https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/708768 ?
[14:00:10] <_joe_>	 nemo-yiannis: uh why benchmarking in staging? that's a couple of underpowered VMs
[14:00:23] <_joe_>	 you should point production traffic to the eqiad/codfw clusters instead
[14:00:27] <nemo-yiannis>	 because we only have staging at the moment 
[14:00:45] <_joe_>	 well then you need to deploy to the main clusters before doing proper benchmarking
[14:01:23] <_joe_>	 more importantly, staging can be turned off for maintenance if we need to, with the assumption that no production traffic goes there
[14:02:08] <nemo-yiannis>	 we are not sending production traffic, we are mirroring some requests to see how tegola responds
[14:03:13] <_joe_>	 oh I see
[14:03:24] <_joe_>	 well, I don't think there is much space on those VMs
[14:03:47] <_joe_>	 so it's very possible your attempt will fail to allocate all pods
[14:04:00] <nemo-yiannis>	 ok, got it. should be fine to wait, we already have better understanding even with those 3 pods.
[14:04:04] <_joe_>	 anyways, no opposition if it's temporary, but your benchmarks will be 20-30% off
[14:04:17] <_joe_>	 I can tell you from experience of running mediawiki there
[14:04:51] <_joe_>	 I was getting horrible latencies in my benchmarks, then just moved to *pods of the same size* but in the prod cluster, and the latency went back to what I expected 
[14:06:15] <_joe_>	 nemo-yiannis: actually, I think I can help you, I should remove mwdebug from staging, it's a pretty large pod and I don't reallly need it to be there anymore
[14:06:34] <_joe_>	 so, bump to 6 pods, if it fails, let me know :)
[14:07:00] <nemo-yiannis>	 awesome, thanks!
[14:07:32] <_joe_>	 but again, be careful with the results, it's possible things will be better significantly on the production cluster
[14:12:56] <dcausse>	 _joe_: ok, for context the problem is a topic that is frequently backlogged since the switch ( https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cluster=main-codfw&var-topic=codfw.cpjobqueue.partitioned.mediawiki.job.cirrusSearchElasticaWrite&var-consumer_group=All&from=now-30d&to=now )
[14:13:23] <_joe_>	 sop the huge initial backlog was due to those failing pods
[14:13:48] <dcausse>	 I'm looking into 1/ more messages genarally produced to it after the switch or 2/ jobqueue overloaded 
[14:14:08] <_joe_>	 not sure about the rest, but I'm also alone from my team today, and hugh from PET is not around either so I don't have much spare time to investigate
[14:14:26] <_joe_>	 I would suspect a kafka server with that topic being slightly overloaded could explain the behaviour too
[14:14:38] <dcausse>	 sure, np thanks for responding already :)
[14:44:31] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) >>! In T279309#7242742, @Dzahn wrote: > @wiki_willy Here it would be great for us if next someone could finish the setup of mw1447 through mw1450 and take a...
[15:58:58] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Release-Engineering-Team, 10SRE, 10Patch-For-Review: Ensure the code is deployed to mediawiki on k8s when it is deployed to production - https://phabricator.wikimedia.org/T287570 (10Joe) I uploaded a very simplistic script that could be used as a systemd timer, or invoked by...
[16:00:34] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE: Make all httpbb tests pass on the mwdebug deployment. - https://phabricator.wikimedia.org/T285298 (10Joe)
[16:25:32] <wikibugs>	 10serviceops, 10DBA, 10User-fgiunchedi, 10cloud-services-team (Kanban): Roll restart haproxy to apply updated configuration - https://phabricator.wikimedia.org/T287574 (10Andrew) Restarting haproxies in wmcs is fairly harmless, just ping when it's time.
[17:22:22] <subbu>	 did something change mariadb authentication? first i got an error that the testreduce nodejs client doesn't support authentication mechanisms for mariadb ... i upgraded the library and now i get a 'ER_ACCESS_DENIED_NO_PASSWORD_ERROR' ... this is on testreduce1001.eqiad.wmnet
[17:26:07] <_joe_>	 subbu: this is probably not the right channel to ask, but it's possible some database in eqiad is under maintenance
[17:26:20] <_joe_>	 but to your question: nothing I'm aware of changed
[17:26:34] <subbu>	 ok ..what is the dba channel?
[17:27:05] <subbu>	 oh .. actually this is not a production db.
[17:27:50] <subbu>	 it is a local db on testreduce1001 .. so i woinder if some puppet patch accidentally modified the password.
[17:28:42] <majavah>	 db puppetization doesn't tend to modify things automatically
[17:31:22] <subbu>	 hmm .. i cannot connect with the mysql commandline client either now ("Access denied for user 'testreduce'@'localhost'") .. so something changed on that server.
[17:32:24] <majavah>	 https://sal.toolforge.org/log/aCCp8XoB1jz_IcWufNo8 maybe?
[18:35:20] <_joe_>	 yeah that seems likely
[18:35:28] <_joe_>	 subbu: where can I find the credentials on disk?
[21:38:23] <wikibugs>	 10serviceops, 10MW-on-K8s, 10SRE, 10Patch-For-Review, 10User-jijiki: Create a variant of mediawiki-multiversion which installs php-tideways-xhprof - https://phabricator.wikimedia.org/T287495 (10dancy) @Joe  Please try docker-registry.discovery.wmnet/restricted/mediawiki-multiversion-debug:2021-07-29-2046...
[23:05:11] <wikibugs>	 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Jclark-ctr) @Dzahn   Thanks! putting the rest in A will speed up racking currently i was waiting on rack C