[00:31:47] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[01:24:43] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Initial ab run:  {P32126}
[05:51:35] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -...
[05:52:17] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling)
[07:08:18] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) > I looked more closely at one of them with tcpdump  So are cross-DC connections happening in plain text?
[07:17:26] <moritzm>	 I'm temporarily switching kubetcd2005 to DRBD to empty a Ganeti node for reimage, latency may go up for while
[07:41:41] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic.
[07:45:09] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) All cross-DC connections except the first had an associated statsd metric `MediaWiki.wanobjectcache.rdbms_server_readonly.hit.refresh`, which imp...
[07:45:41] <moritzm>	 kubetcd2005 is back to normal
[07:46:31] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) >>! In T279664#8122493, @tstarling wrote: > @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at...
[09:18:55] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Joe) >>! In T279664#8122378, @tstarling wrote: > A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimar...
[10:55:24] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon)
[10:55:54] <wikibugs>	 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs from the active DC to fix up where MW f...
[13:57:52] <inflatador>	 Hello ServiceOps, looks like jobqueue latency is high again, ref: https://phabricator.wikimedia.org/T300914 . Is anyone able to help out on this?
[14:03:08] <dcausse>	 ^ the backlog dropping to 0 once it reaches 7days feels like kafka offsets are being reset to earliest by changeprop 
[14:36:55] <_joe_>	 inflatador: what are you seeing specifically in that dashboard?
[14:38:04] <inflatador>	 _joe_ I believe https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-90d&to=now is the dashboard that tipped off dcausse , he might have more info
[14:38:15] <_joe_>	 looks like the problem started in may
[14:42:17] <dcausse>	 we're trying to find explanation to this but we believe it's because changeprop is not giving enough "concurrency" to this particular job
[14:54:08] <RhinosF1>	 Hi, papaul needs some mc servers shut down for maintenance. Who is best to ask?
[14:54:34] <RhinosF1>	 _joe_, vgutierrez: ^
[14:55:07] <_joe_>	 RhinosF1: not me right now, I'm about to enter an interview
[14:55:11] <vgutierrez>	 that would be serviceops team IIRC
[14:55:36] <_joe_>	 jayme: around?
[14:55:52] <RhinosF1>	 _joe_: this needs doing asap if you can find someone.
[14:56:43] <_joe_>	 RhinosF1: let me talk with papaul
[14:56:49] <_joe_>	 thanks, I'll take it from here
[14:56:54] <RhinosF1>	 thanks
[15:28:36] <jayme>	 _hoyeah
[15:28:41] <jayme>	 oops
[15:28:49] <jayme>	 _joe_: yeah :)
[15:29:49] <jayme>	 sorry, was out for groceries ... looks like it's handled? Let me know if not
[15:59:57] <_joe_>	 ask moritz, I was in an interview :P
[16:01:43] <hnowlan>	 on the jobqueue issue - I have been looking at the health of the pods etc and I can't see anything wrong, the QoS strategy seems to be working etc
[16:01:57] <hnowlan>	 but one thing I talked with petr about before he left was increasing the number of workers (currently at 1)
[16:03:19] <hnowlan>	 It might have no effect but I'd like to give it a brief spin (although I am aware it might bump memory requirements a bit also so it could be very brief as an experiment) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/819627
[16:06:46] <_joe_>	 hnowlan: +1, you cna also increase memory a bit if needed
[16:10:18] <_joe_>	 but something definitely changed around the start of may, and got worse on may 27th
[16:11:28] <_joe_>	 so, what I see is that the average time for completing a job went from ~ 160 ms in april to ~ 600 ms now
[16:11:44] <_joe_>	 which would explain why we're not keeping up if we use the same concurrency as before
[16:11:46] <hnowlan>	 rather than changed it might be triggered or degraded. This has been happening on and off in some way since we moved to k8s I think. Some jobs end up getting assigned to pods that underperform in some manner or other
[16:12:04] <_joe_>	 see march:
[16:12:06] <_joe_>	 https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=1646181707765&to=1648482697378&viewPanel=3
[16:12:31] <_joe_>	 and now: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=now-30d&to=now&viewPanel=3
[16:12:48] <hnowlan>	 Some jobs begin to experience degraded performance due to insufficient concurrency despite ample resources for the pods themselves - I suspect it's that because of the scattershot way that jobs are assigned, some pods end up with multiple jobs and some have none. It's a frustrating limitation of changeprop itself
[16:12:49] <_joe_>	 running a single job takes about 4x the time
[16:13:02] <_joe_>	 hnowlan: I don't think that's the case here
[16:13:44] <_joe_>	 given a value of concurrency, you'll complete X jobs per hour at latency Y, and 1/4 * X if latency is 4*Y
[16:14:11] <_joe_>	 so the cause here is not gremlins in changeprop, but a degraded performance from the search jobs
[16:14:21] <_joe_>	 the cure maybe is raising the concurrency?
[16:14:42] <dcausse>	 concurrency is already set to 150 but it seems it only gets 10
[16:15:05] <hnowlan>	 yeah, configured concurrency very rarely corresponds to achieved in many jobs 
[16:15:56] <hnowlan>	 I'm happy to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/819627 now to see what happens if that sounds ok
[16:16:26] <dcausse>	 jobtime increased end of feb because previously all updates were being queued here, now only the "slow" cloudelastic cluster is routed to this jobqueue
[16:16:51] <dcausse>	 hnowlan: if you think it might help I'm all for it :)
[16:17:56] <dcausse>	 we discussed with ebernhardson to hack the partitionning of this particular job so that more kafka consumers gets assigned, hopefully leading to increased concurrency
[16:18:47] <dcausse>	 (hoping that the consumers get spread accross different pods)
[16:27:09] <wikibugs>	 10serviceops, 10SRE, 10SRE-OnFire, 10serviceops-collab, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10JMeybohm) >>! In T313355#8091114, @CDanis wrote: > filter_victorops_calendar requires some persist...
[17:24:24] <hnowlan>	 seeing a bit of a bump in jobqueue throughput but I'm not sold on the idea of it being a fix. Will look at it more tomorrow
[18:06:03] <wikibugs>	 10serviceops, 10Patch-For-Review: Productionise  mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10RLazarus) Due to T309956 I'm moving ahead with mc2038 early, and using it to replace mc2024 which is currently out of service.
[18:18:52] <wikibugs>	 10serviceops, 10Patch-For-Review: Productionise  mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9144f946-b4bf-404f-ac18-b48c723e759c) set by rzl@cumin2002 for 2:00:00 on 1 host(s) and their services with reason: install ` mc2038.codfw...
[18:42:46] <wikibugs>	 10serviceops, 10Patch-For-Review: Productionise  mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10RLazarus) !log rzl@cumin2002 START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet
[18:46:06] <wikibugs>	 10serviceops, 10Patch-For-Review: Productionise  mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=819c43c7-89c1-4902-b636-e59fabe9011d) set by rzl@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: install ` mc2038.codfw...
[19:17:31] <wikibugs>	 10serviceops, 10Patch-For-Review: Productionise  mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1da0e7c2-1678-4926-858d-8864d2200918) set by rzl@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: install ` mc203...
[20:37:03] <wikibugs>	 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster
[21:11:37] <wikibugs>	 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster completed: - gerrit2002 (**PASS**)...