[00:04:49] 🦇 ☎️ [00:06:15] 🏏 📞 🤔 [00:08:02] 👏 [00:08:15] 🎬 🤔 [00:16:48] urandom: btw if you want to add yourself to the config so sirenbot puts the right nick in the topic, it's under /srv/private/hieradata/role/common/alerting_host.yaml on puppetmaster1001 [00:22:27] Cool [01:35:08] Is there a process besides submitting a puppet.git patch for changing one's ssh key? (/me checks https://wikitech.wikimedia.org/wiki/SRE/Production_access#Generating_your_SSH_key [01:35:18] this is for Peter in my team - ref https://gerrit.wikimedia.org/r/c/operations/puppet/+/857529 [01:36:56] denisse|m: soft ping, seems like maybe something for clinic duty? not urgent :) [01:38:51] Krinkle: usually a patch + an out of band confirmation is what most of us do [01:39:24] peter is CET, who should I tell him to PM tomorrow? [01:39:30] if you can send me the same via email, I am happy to merge this right now [01:39:48] ack [01:45:02] Krinkle: merged [01:46:35] thx <3 [08:01:14] Hello, [08:01:14] I'll be patching turnilo on an-tool1007.eqiad.wmnet to fix the introspection bug we had after the upgrade as per https://phabricator.wikimedia.org/T308778 . [08:01:14] Please let me know if the time on Wed, Nov 23 2022 between 09:30UTC and 10:30UTC will inconvenience you and I can push back the time. [08:46:48] steve_munene: thanks for the heads up! I think it's fine (at least for me) as long as there will not be any ongoing issue by thtat time. So my suggestion would be to keep an eye on -operations before starting to check if there is any emergency ongoing that might need turnilo for the troubleshooting. [08:49:31] Thanks I'll keep that in mind. [09:25:42] hello folks [09:25:54] I am adding an alert for SRE related to webrequest-sampled-live: https://gerrit.wikimedia.org/r/c/operations/alerts/+/859502/ [09:26:09] if Druid Analytics doesn't index events for some time we get an alert [09:26:34] since in that case the fault is most probably Benthos-related, it didn't make sense to target the DE folks :) [09:27:09] (also there is a runbook with some high level explanation about the pipeline, and where to look) [09:45:54] reminder: going to reboot cumin1001 in 15 minutes [09:55:42] ack [10:06:59] cumin1001 is back up (only the homer keyholder still needs to be armed) [10:23:27] <_joe_> elukey: it won't be affected by turnilo's state? [10:23:54] <_joe_> also I'd have doubts about ownership of the whole pipeline [10:24:08] <_joe_> but well, above our pay grade I guess [10:25:51] _joe_ nono it is only on the Druid side, if turnilo is down it will not fire. We can move it in any place, I just wanted to have an alert in place :D (with our luck it will be down the first time that we need it) [10:26:26] elukey: godog: Can we target multiple teams in an alertmanager configuration? [10:27:26] btullis: not ATM, the easiest would be to create a new team I think and then notify two teams in the AM config accordingly [10:27:27] The turnilo upgrade is all finished by the way, thanks to steve_munene [10:28:04] godog: Ack, thanks. [10:47:54] It might be nice to have a Wikitech page on Benthos itself, as well as the mention in the runbook. But as _joe_ says, who owns it at this stage :-) I can try to write something. [10:53:58] yeah good call on a wikitech page btullis, I'll get one started after lunch [10:54:27] vgutierrez: thanks a lot for the coreutils sha256sum, I wasn't aware of the issue! [10:54:33] *tip [10:54:59] this will be useful not only for database backups, but also for mediabackups, which heavely uses sha256 [11:00:15] tbh I'm surprised that sha256 is faster than md5 [11:00:43] staring at https://en.wikipedia.org/wiki/Secure_Hash_Algorithms it shouldn't be the case [11:02:11] hmm wait [11:02:15] jynus: https://en.wikipedia.org/wiki/Intel_SHA_extensions that could explain it [11:02:29] take into account that I ran the tests in cp5020, one of the new hosts shipping Ice Lake CPUs [11:02:50] https://www.intel.com/content/www/us/en/products/sku/215271/intel-xeon-gold-5318y-processor-36m-cache-2-10-ghz/specifications.html [11:03:53] I was doing the test on my local machine (AMD) [11:06:29] let me redo it on a db test machine [11:09:00] db2102:~$ cat /proc/cpuinfo | grep sha -> ❌ [11:09:22] so I will only get the speedup on a recent cpu, I guess [11:14:45] vgutierrez@cp5020:~$ grep -m 1 sha /proc/cpuinfo [11:14:45] flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm [11:14:45] abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single intel_ppin ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp [11:14:45] hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid fsrm md_clear pconfig flush_l1d arch_capabilities [11:14:57] jynus: a cat died after that cat | grep [11:15:18] sha_ni is the one [11:15:22] indeed [11:15:43] this should inform our future hw purchases [11:16:17] for hosts where sha256 encryption performance is critical [11:16:34] s/encryption/hashing/ [11:16:46] yes, sorry, I misstyped, I know the difference [11:21:41] number as quite interesting on a non-sha_ni host [11:21:44] *numbers [11:21:52] will paste on the resolved ticket [11:23:57] this makes me wonder if we should prefer SHA256 ciphersuites for TLS termination rather than SHA384 ones [11:24:08] vgutierrez@cp5020:~$ openssl ciphers -tls1_3 -s [11:24:08] TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_128_GCM_SHA256 [11:26:01] the answer will be, it depends: https://phabricator.wikimedia.org/T323485#8416789 [11:26:35] I gather 2 conclusions: openssl is equal or better than coreutils in most cases [11:26:49] and if there is hw acceleration, it makes a huge difference [11:28:24] maybe a 3rd: test performance on actual hw, otherwise one may make the wrong decision about prefererd options/algorithms [11:28:26] jynus: I guess that you can check for sha_ni in the used hosts [11:28:39] s/used/targeted/ [11:29:07] a quick cumin check shows that 56% of our fleet has sha_ni extensions [11:29:19] it's also in the cpu_flags fact btw [11:29:22] yes, but 5 seconds -> 9 seconds is an easier pill to swallow than 5 -> 17 [11:29:37] volans: it should [11:29:54] vgutierrez: I may be open to not even query, just use openssl implementation everywhere [11:31:04] I will write a task to stop discussing on the resolved task [11:31:08] jynus: I'm assuming that even without sha_ni, openssl has some optimizations that coreutils is missing [11:31:17] yeah [11:31:53] maybe it is purely cpu-basic instruction based, and there are other matrix extensions that speed it up a little bit [11:32:22] interesting.. 98% of our cp hosts got sha_ni already [11:32:43] hmmm grepping for sha isn't the way to go apparently :) [11:32:53] this started as a backup improvement, but may have impact on traffic, and even mediawiki [11:33:07] as vola.ns yu can check for this in puppet with `'sha_ni' in $facts['cpu_flags']` e.g. sudo puppet apply -e "notice('sha_ni' in \$facts['cpu_flags'])" [11:33:09] volans mentioned I think it was on puppet [11:33:10] sha_ni shows 24% VS 76% [11:33:35] jbond: that should be queriable on puppetboard? [11:33:59] vgutierrez: possibly [11:34:22] I think it's not imported there [11:34:30] that's cpu_details['flags'] [11:34:53] it's in profile::puppetdb::facts_blacklist [11:35:16] ahh so it never makes it to puppetdb. we can probably remove that if its usefull to have [11:35:19] because it's a big one [11:35:26] but yeah can be revisited [11:35:28] the value is fairly static so shouldn;t cause issues with pupetdb [11:36:06] i think we removed it when we where being much more conservative with puppetdb but the real problem was relationships so we can be less conservative with facts, especially ones that ar pretty static [11:36:20] * jbond drftas patch [11:37:16] jynus: but yeah.. openssl dgst >> sha256sum even without sha_ni in place [11:37:56] * jbond https://gerrit.wikimedia.org/r/c/operations/puppet/+/860006 [11:38:10] but can you see when there was almost a -66% performance degradation, that was not seen as feasable? [11:38:31] this, however is much smaller or even an increase in performance! [11:38:49] jynus: performance shouldn't be the only factor IMHO [11:39:09] and the focus was not on tampering within-dc, but corruption [11:40:25] performance shouldn't be the only factor, however sometimes performance prevents things from happening [11:40:39] using a cryptographic hash for error detection is... interesting :) [11:41:07] for example, I belive the future for databases is encryption at rest [11:42:22] (that doesn't mean tls should be abandoned, I just think it is a more important issue) [11:44:23] vgutierrez: and now with jbond's patch you'll be able to see them with 'F:cpu_flags ~ "sha_ni"' [11:44:29] vgutierrez: i have removed cpu_flags from the black list so in 30 mins you should be able to get an idea of what has it [11:44:34] jbond: <3 [11:44:38] (although it's an undocumented feature of puppetdb querying ;) ) [11:46:32] jynus: I don't see the relation between encryption at rest and TLS TBH [11:46:52] and why those should be mutually exclusive [11:47:06] they are not [11:48:21] what it is scarce is our ability on working on everything at the same time :-D [12:19:43] Quick question: Teams outside of SRE, but with embedded SREs (in no particular order) Data Engineering, Machine Learning, Search, Fundraising - Have I missed any? [12:22:14] WMCS [12:23:03] taavi, of course! Thanks. How could I forget? Any others? [12:24:37] Platform [12:25:01] <3 Thanks hnowlan. [12:28:16] btullis: depending on exactly what question you are answering funderaising tech may not be considered embeded sre as they mostly forked there infrastructre and afaik dont have much shared code or infrastructre with sre production anymore [12:29:15] however i think they all/most still have root so like i said depends on the question :) [12:33:03] Thanks all. For reference, I'm tinkering with a draft of the top level SRE page: https://wikitech.wikimedia.org/wiki/User:Btullis/SRE - adding links to teams with embedded SREs to see if it makes sense. [12:33:59] Bridge-building, inclusion kind of thing. Nothing really technical. [12:52:09] frack SREs by default don't have root within main prod, there's basically just one account which predates that and still has root within prod [12:53:08] so I'd in fact rather omit it to reduce confusion [12:59:39] ack thanks [13:00:40] btullis: may want to speak with jobo i thikn they have been working on some stuff aroudn collaberation, bridge building team apis etc, cold be some cross over [13:19:59] Thanks all. Will do. [13:34:08] btullis: that page is a bit hard to read, because you need to click to open all of the boxes for a relatively short text [14:04:46] vgutierrez: I might have misread your earlier percentaces, but AFAICT only 27 hosts have the sha_ni flag. [14:04:51] *percentages [14:04:52] taavi: yes I agree, many thanks. I'll keep tinkering a bit with the draft. [14:06:15] volans: my bad... gripping by sha matched another unrelated CPU flag [14:06:22] *grepping [14:06:28] ack [14:48:38] btullis: I started the wikitech article here https://wikitech.wikimedia.org/wiki/Benthos [15:22:40] <_joe_> who left some uncommitted changes in /srv/private [15:22:46] <_joe_> but did add them to git? [15:22:58] <_joe_> I just committed the new kafka certs [15:23:06] <_joe_> btullis ? ottomata ? [15:23:07] I'm doing it now with Steve [15:23:16] https://phabricator.wikimedia.org/T323697 [15:23:19] <_joe_> btullis: the certs are committed [15:23:25] OK, thanks. [15:23:37] <_joe_> sorry I was modifying a single file there and it was a quick fix [15:23:49] <_joe_> so I didn't check for other stuff in git before committing [15:24:02] No worries. :-) [15:24:08] <_joe_> I just wanted you to be aware [15:24:17] <_joe_> because it could cause issues otherwise [15:24:58] Yes, I was in the middle of a well-formed commit message with Bug: number etc. Explaining how the private repo works. Not a problem anyway. [15:25:14] <_joe_> lol sorry btullis steve_munene :P [15:25:43] <_joe_> I mean we can fix it, but maybe it's not worth it [15:26:32] No, it's fine. In fact it was a useful way of showing how we deal with concurrent edits by talking about it here. No need to rewrite the git history. [15:40:36] We're about to roll-restart jumbo-eqiad now to pick up the new certificates. [19:32:04] urandom: herron: moving over lvs4007 to lvs4010. no issues expected, but just in case (I will fix it if I break it :) [19:32:18] sukhe: okie dokie [20:06:34] herron: all done. there shouldn't be any alerts now and I will keep an eye out but please ping me if you see something and I miss [20:06:50] ed [20:07:21] kk thanks will do (hopefully won't need to) [20:07:31] yeah! these are tricky and first time I am doing [20:07:34] so hopefully