[00:31:47] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [01:24:43] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) Initial ab run: {P32126} [05:51:35] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimaryRunningReadOnly(). While running ab -... [05:52:17] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) [07:08:18] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) > I looked more closely at one of them with tcpdump So are cross-DC connections happening in plain text? [07:17:26] I'm temporarily switching kubetcd2005 to DRBD to empty a Ganeti node for reimage, latency may go up for while [07:41:41] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at surrounding (DC-local) memcached traffic. [07:45:09] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10tstarling) All cross-DC connections except the first had an associated statsd metric `MediaWiki.wanobjectcache.rdbms_server_readonly.hit.refresh`, which imp... [07:45:41] kubetcd2005 is back to normal [07:46:31] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10jcrespo) >>! In T279664#8122493, @tstarling wrote: > @jcrespo No, cross-DC DB connections are encrypted but you can figure out what's going on by looking at... [09:18:55] 10serviceops, 10Performance-Team, 10SRE, 10Traffic, 10Patch-For-Review: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10Joe) >>! In T279664#8122378, @tstarling wrote: > A possible reason for the slightly slower times on codfw is cross-DC connections for LoadBalancer::isPrimar... [10:55:24] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) [10:55:54] 10serviceops, 10Performance-Team, 10SRE, 10SRE-swift-storage, and 2 others: Progressive Multi-DC roll out - https://phabricator.wikimedia.org/T279664 (10MatthewVernon) Are you proposing to do away with the concept of "active" DC, then? e.g. currently `swiftrepl` runs from the active DC to fix up where MW f... [13:57:52] Hello ServiceOps, looks like jobqueue latency is high again, ref: https://phabricator.wikimedia.org/T300914 . Is anyone able to help out on this? [14:03:08] ^ the backlog dropping to 0 once it reaches 7days feels like kafka offsets are being reset to earliest by changeprop [14:36:55] <_joe_> inflatador: what are you seeing specifically in that dashboard? [14:38:04] _joe_ I believe https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=cpjobqueue-cirrusSearchElasticaWrite&from=now-90d&to=now is the dashboard that tipped off dcausse , he might have more info [14:38:15] <_joe_> looks like the problem started in may [14:42:17] we're trying to find explanation to this but we believe it's because changeprop is not giving enough "concurrency" to this particular job [14:54:08] Hi, papaul needs some mc servers shut down for maintenance. Who is best to ask? [14:54:34] _joe_, vgutierrez: ^ [14:55:07] <_joe_> RhinosF1: not me right now, I'm about to enter an interview [14:55:11] that would be serviceops team IIRC [14:55:36] <_joe_> jayme: around? [14:55:52] _joe_: this needs doing asap if you can find someone. [14:56:43] <_joe_> RhinosF1: let me talk with papaul [14:56:49] <_joe_> thanks, I'll take it from here [14:56:54] thanks [15:28:36] _hoyeah [15:28:41] oops [15:28:49] _joe_: yeah :) [15:29:49] sorry, was out for groceries ... looks like it's handled? Let me know if not [15:59:57] <_joe_> ask moritz, I was in an interview :P [16:01:43] on the jobqueue issue - I have been looking at the health of the pods etc and I can't see anything wrong, the QoS strategy seems to be working etc [16:01:57] but one thing I talked with petr about before he left was increasing the number of workers (currently at 1) [16:03:19] It might have no effect but I'd like to give it a brief spin (although I am aware it might bump memory requirements a bit also so it could be very brief as an experiment) https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/819627 [16:06:46] <_joe_> hnowlan: +1, you cna also increase memory a bit if needed [16:10:18] <_joe_> but something definitely changed around the start of may, and got worse on may 27th [16:11:28] <_joe_> so, what I see is that the average time for completing a job went from ~ 160 ms in april to ~ 600 ms now [16:11:44] <_joe_> which would explain why we're not keeping up if we use the same concurrency as before [16:11:46] rather than changed it might be triggered or degraded. This has been happening on and off in some way since we moved to k8s I think. Some jobs end up getting assigned to pods that underperform in some manner or other [16:12:04] <_joe_> see march: [16:12:06] <_joe_> https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=1646181707765&to=1648482697378&viewPanel=3 [16:12:31] <_joe_> and now: https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-job=cirrusSearchElasticaWrite&from=now-30d&to=now&viewPanel=3 [16:12:48] Some jobs begin to experience degraded performance due to insufficient concurrency despite ample resources for the pods themselves - I suspect it's that because of the scattershot way that jobs are assigned, some pods end up with multiple jobs and some have none. It's a frustrating limitation of changeprop itself [16:12:49] <_joe_> running a single job takes about 4x the time [16:13:02] <_joe_> hnowlan: I don't think that's the case here [16:13:44] <_joe_> given a value of concurrency, you'll complete X jobs per hour at latency Y, and 1/4 * X if latency is 4*Y [16:14:11] <_joe_> so the cause here is not gremlins in changeprop, but a degraded performance from the search jobs [16:14:21] <_joe_> the cure maybe is raising the concurrency? [16:14:42] concurrency is already set to 150 but it seems it only gets 10 [16:15:05] yeah, configured concurrency very rarely corresponds to achieved in many jobs [16:15:56] I'm happy to deploy https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/819627 now to see what happens if that sounds ok [16:16:26] jobtime increased end of feb because previously all updates were being queued here, now only the "slow" cloudelastic cluster is routed to this jobqueue [16:16:51] hnowlan: if you think it might help I'm all for it :) [16:17:56] we discussed with ebernhardson to hack the partitionning of this particular job so that more kafka consumers gets assigned, hopefully leading to increased concurrency [16:18:47] (hoping that the consumers get spread accross different pods) [16:27:09] 10serviceops, 10SRE, 10SRE-OnFire, 10serviceops-collab, 10Patch-For-Review: productionize 'sremap' and 'filter_victorops_calendar' under sretools.wikimedia.org - https://phabricator.wikimedia.org/T313355 (10JMeybohm) >>! In T313355#8091114, @CDanis wrote: > filter_victorops_calendar requires some persist... [17:24:24] seeing a bit of a bump in jobqueue throughput but I'm not sold on the idea of it being a fix. Will look at it more tomorrow [18:06:03] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10RLazarus) Due to T309956 I'm moving ahead with mc2038 early, and using it to replace mc2024 which is currently out of service. [18:18:52] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9144f946-b4bf-404f-ac18-b48c723e759c) set by rzl@cumin2002 for 2:00:00 on 1 host(s) and their services with reason: install ` mc2038.codfw... [18:42:46] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10RLazarus) !log rzl@cumin2002 START - Cookbook sre.hosts.remove-downtime for mc2038.codfw.wmnet !log rzl@cumin2002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for mc2038.codfw.wmnet [18:46:06] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=819c43c7-89c1-4902-b636-e59fabe9011d) set by rzl@cumin2002 for 1:00:00 on 1 host(s) and their services with reason: install ` mc2038.codfw... [19:17:31] 10serviceops, 10Patch-For-Review: Productionise mc20[38-55] - https://phabricator.wikimedia.org/T293012 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1da0e7c2-1678-4926-858d-8864d2200918) set by rzl@cumin2002 for 1 day, 0:00:00 on 1 host(s) and their services with reason: install ` mc203... [20:37:03] 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster [21:11:37] 10serviceops, 10Gerrit, 10SRE, 10serviceops-collab, and 2 others: replacement for gerrit2001 - https://phabricator.wikimedia.org/T243027 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin2002 for host gerrit2002.wikimedia.org with OS buster completed: - gerrit2002 (**PASS**)...