[07:43:14] As a heads up: I'll be running some docker pull performance tests again. Those will take place on all non-ml kuberenetes nodes in eqiad which do point to docker-registry nodes in eqad currently. So there "should" not be that much cross-dc traffic (due to local caches in the docker-registry) resulting from that. [08:16:58] Krinkle: i think it's too hard to tell long-term trends yet [08:17:06] (which does mean that if we are sinking, it's _slowly_) [08:27:17] XioNoX: o/ was the netflow sampling changed? Nothing against it but I'd ping Analytics in case so they can watch for any data size change (shouldn't be a big problem but better safe than sorry) [08:28:57] elukey: yes it was added to "transport" links on the CRs in eqiad (i.e. our internal links to other DCs) [08:29:02] https://phabricator.wikimedia.org/T286038 [08:29:23] only a temporary measure to get some statistics. [08:29:56] sounds like a good suggestion to talk to Analytics and check that though - thanks! [08:30:43] topranks: it shouldn't be a big issue but netflow traffic is ingested in HDFS and then to indexed to Druid (to be queriable by Turnilo) [08:31:11] so better to check size as new traffic comes in [08:31:27] elukey: it should be peanuts compared to what we're already sampling [08:32:20] XioNoX: ack than all good, I was doing the paranoid SRE :D [08:32:49] elukey: thanks! [08:38:15] elukey: is superset real time? [08:38:49] XioNoX: if you pull data from the netflow druid datasource it should be the same as turnilo [08:39:23] (so not from the presto one, since it uses hive and hdfs, hence relying on the batching jobs) [08:39:27] *batch [08:39:53] there was wmf_netflow and netflow (and a few others) [08:40:19] should be wmf_netflow in theory, but lemme check what we set for turnilo [08:40:45] yep wmf_netflow [08:41:18] interesting, I can't match on a regex with the druid one [08:50:42] XioNoX: regexes are magic, druids are magic, you can't mix different schools of magic like that, it's dangerous [08:50:55] :) [13:02:16] kormat: ack. I noticed a big change in the graphs. I'm guessing that's from re/imaging, and they'll sync up again on their own after that? Anyway, I've added a DC option to the PC dashboard so that we can view codfw in the meantime for continuity [13:05:37] Krinkle: we haven't reimaged any of the pc hosts.. ? [13:06:19] SAL says otherwise [13:07:16] can you point me to what you're looking at? [13:07:44] https://sal.toolforge.org/production?p=0&q=robh&d=2021-07-15 [13:08:13] ahh. that's work on the new pc hosts which aren't in service yet [13:08:15] Those are the new ones [13:08:17] Yeah [13:08:24] right. they're not in use [13:08:26] I did cause a big drop in stats though [13:09:24] https://grafana.wikimedia.org/d/000000106/parser-cache?viewPanel=3&orgId=1&from=now-7d&to=now&var-contentModel=wikitext&var-dc=eqiad&var-source_ops=eqiad%20prometheus%2Fops [13:10:07] Is something must've started or stopped counting those hosts at that point [13:10:10] o.O [13:10:47] the new hosts are in role::insetup [13:11:30] ah. i see. the dashboard doesn't look at what's in use [13:11:37] it just adds disk space for all pc* hosts [13:11:44] ok, so that's not very useful in this case :) [13:11:57] you're better off looking at the diskspace used on individual nodes [13:14:20] Is there a relevant tag that makes it through to these general host stats stating the role or host group of some kind? [13:15:07] I think "cluster" is one but that's just "misc" [13:16:29] Anyway, I'll change this up a bit later to stack the hostnames for transparency. [13:16:56] Krinkle: shard=pc1 [13:17:36] ah, host-level stats. good question. i'm kinda assuming no [13:30:01] The dashboard looks really good, shame it won't be like that for long [13:43:25] is anybody currently working on sessionstore? [13:46:54] is it active-active or active-passive? [13:47:56] active-active [13:48:07] that's bad, then [13:54:32] <_joe_> wait, what? [13:54:53] <_joe_> sessionstore is currently codfw only [13:55:20] <_joe_> I didn't get the page it seems [13:56:35] (it was depooled on eqiad, no user impact, details at -ops, for those that only read this later without context) [14:09:44] _joe_: what I meant is that it's an active-active service, and you're right, currently it's depooled on eqiad [19:06:26] volans: I could use help on https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/701474 with how to fix "mccabe: MC0001 / MysqlLegacy.get_core_dbs is too complex (11)" [19:50:46] legoktm: I can have a look at it in ~45m when I'm back home [19:51:03] Ty, no rush though! [20:40:52] legoktm: I left some comment that might in turn fix mccabe too (or not) [20:41:04] if it doesn't we can evaluate options [20:45:01] from disabling it for that function to taking something out. One possible candidate is a "validate_params" or similar that will take care of all the conditionals that raise exception if the parameters are not correct. And it might also make the logic of the method more fluid to read. [20:45:08] but let's not get too much ahead [20:52:39] volans: ty and gotcha [21:04:49] volans: yep, that did the trick :D [21:18:10] :) [22:15:26] volans, rzl: did one of you two want to +2 it or should I? [22:16:07] feel free, just +2 is enough, if CI doesn't fail ofc [22:16:19] it will merge and push the new doc [22:16:58] I'll take care of a new release in the next days (do we already have a date for the switchback and/or for the dry-run/reverse tests?) [22:24:36] September 16 is the tentative, I think it isn't confirmed yet [22:24:44] for the switchback that is [22:28:48] week of Sept 13*, I'm going to send a last call-type email to sre@ in a bit, we're still waiting for CommRel to thumbs up the date [22:29:30] ohh that's a long time, ok