[09:59:23] slyngs: please replace T123456 with a real task number in https://gerrit.wikimedia.org/r/c/operations/puppet/+/789570/.. that's a real task so that just added a confusing gerritbot message in there [09:59:23] T123456: Special:CentralAuth reports account attachment, which - being standalone - is confusing, report accout creation as well - https://phabricator.wikimedia.org/T123456 [12:13:34] hello :) I have a couple patch for CI which I have deployed and could now use a puppet-merge please : https://gerrit.wikimedia.org/r/c/operations/puppet/+/774525 and https://gerrit.wikimedia.org/r/c/operations/puppet/+/774771 [12:14:07] they affect solely a WMCS instance, no change to prod ;) [12:35:05] hashar: merged [12:35:27] jbond: thx ;) [12:35:41] still have to poke at the gerrit logout script but I went busy with other duties this week ;) [12:35:59] ack no worries just ping when free [13:09:42] tfw when the reboot-single command dies with: [13:09:46] bash: line 1: depool: command not found [13:17:10] klausman: where's `reboot-single` from? [13:22:53] cumin [13:22:59] as in the cookbook [13:23:13] so "command" is not really the right term [13:24:23] it runs 'depool' on the target host, if it's a confctl-managed load-balanced host it will get depooled [13:24:39] only if you run the cookbook with --depool though [13:45:23] Dear SRE, do we have any eqiad<->codfw latency SLOs? [13:45:36] (if that sounds like a loaded question...that's because it is! :D ) [13:56:45] ottomata: since eqiad<->codfw is served by multiple direct links, latency variance probably isn't really a primary issue. We kind of "know" more or less what it can be in various link fail scenarios, assuming no saturation, and assuming you're not getting into fine details at the jitter level. [13:57:07] I mean, it's an indirect measure of some things, but we'd probably first be interested in tracking the more-primary issues there [13:57:42] like: reliability of those 3 direct links, and even how often more than one link outage of the 3 collides/overlaps, and then also saturation of the link(s) [13:57:57] so long as the link is nowhere near saturation, there's no reason to expect a great variance in network latency [13:58:15] but if it is regularly saturating due to heavy flows, then the saturation is the problem we really care about [13:58:28] (and latency is a follow-on indicator of that problem, along with some loss probably) [13:59:46] So the fact that I see ~33ms latency is pretty standard? [14:00:23] https://wikitech.wikimedia.org/wiki/Network_design#/media/File:Wikimedia_network_overview.png [14:00:48] 👀 [14:00:49] ^ we have multiple waves. some of them could be outaged or in maintenance at any time, but they're all fairly close in latency [14:01:02] so long as the links aren't saturating, which is key :) [14:01:10] ok, here's why I ask: :) [14:01:23] we currently use kafka mirrormaker for kafka cross DC replication [14:01:36] e.g. we have 2 'main' kafka clusters [14:01:44] and ideally they both have exactly the same data. [14:02:09] (I have another meeting to jump to, but I'll try to follow async later!) [14:02:12] (k) [14:02:44] however, they are still distinct clusters, which means doing any kind of active/active streaming application on top of them requires a lot of application specific logic and manual maintenance [14:02:50] same for active/passive with failover [14:03:14] an alternative means of achieving the same thing, is a Kafka Stretch cluster [14:03:23] one Kafka cluster with Brokers in both DCs [14:03:50] this is possible with Kafka now because of follower fetching and broker and client 'rack' and awareness [14:04:42] this would allow to make one distributed streaming app with workers in both DCs, relying on the usual kafka client failover bits [14:05:21] shutting down kafka brokers and clients in eqiad would just mean that all the new parttion leaders would be elected in codfw, and work would all be done by consumers in codfw [14:05:53] from blog posts and talks i've watched, they recommend not doing kafka stretch clusters if latency > 100ms [14:06:27] to be clear, this doesn't mean more cross DC network traffic than we are already doing, this would be instead of MirrorMaker [14:23:16] ottomata: yeah bblack sumarized it well [14:26:41] we don't have an SLOs though, there is the current latency of the existing circuits and have been very stable [14:26:41] providers have SLAs, but those are usually quite high and we would get reimbursed peanuts if they didn't respect it [14:27:54] there is also a backup path through Chicago (~55ms end to end) [14:29:47] but that would be the first explicit we need < Xms end to end latency, and it's not just codfw to eqiad, but also the routers and switches in between the hosts themselves [14:31:24] I don't think < 100ms would be an issue, but we should discuss it more in details on a task/design doc/meeting [14:34:16] okay awesome, this is actually more hopeful than I expected [14:36:15] thank you [15:07:17] hnowlan: Thank you ref. https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/767878 ! Agreed its moderately risky :) [16:14:54] <_joe_> TheresNoTime: uhm [16:15:07] <_joe_> the problems that arose from the old parsoid are not here anymore indeed [16:15:46] <_joe_> so for instance we won't have the multiplication of backend calls due to luasandbox usage [16:16:39] * TheresNoTime is waiting for the "but..." :p [16:16:52] <_joe_> nah I was just reasoning out loud [16:17:18] <_joe_> one q: did anyone try to re-render some of those pages directly reqesting them to restbase with a cache-busting query string? [16:17:31] oh phew :D and... I am not sure [16:17:49] <_joe_> that's what I'd do if I wanted to play it safe :) [16:17:51] explicitly setting the revision ID requested to the latest works [16:17:58] <_joe_> that too yes [16:18:04] <_joe_> the pages load? [16:20:41] they do :) [17:22:07] I just disabled the last active SUBVERSION repository on Phabricator. (lol) tell me if you miss svn for toolserver :pp https://phabricator.wikimedia.org/diffusion/TSVN/ [17:22:22] going to remove subversion package from server after this [17:38:11] very glad I don't have to deal with svn :D [17:57:48] bah, Phabricator does complain about the missing svn now.. even though I disabled that repo. Guess I have to delete it or "ignore setup issue" [18:01:33] in another matter; we are now going to upgrade docker (and docker.io -> docker-ce) on the CI servers [19:43:18] mutante: id just ignore, pretty sure it's a soft requirement whether you have any intend to use or not [19:51:28] RhinosF1: the options were: "tell phab to ignore it", "really delete, not just disable, the remaining 4 or 5 SVN repos" or "reinstall the package" [19:51:32] I picked the first, yea [19:52:15] mutante: seems sensible [22:46:40] Is there a list of server SSH key fingerprints I can verify before connecting? [22:54:10] brett: yes, https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints [22:54:21] bd808: Thanks a lot!