[06:53:30] why did they name the initramfs tooling Dracut? [06:58:33] a helpful and curious mind in #linux found the answer: https://dracut.wiki.kernel.org/ [06:58:55] apparently it's a town in massachusetts [07:39:39] <_joe_> Krinkle: that's one of the advantages of mw on k8s - we can decide the size of each pod and optimize it [07:40:20] <_joe_> but overall I do expect some performance penalty from running on newer kernels / more cpu mitigations / more layers of indirection [08:37:59] hi folks, on apt1001 we have some disk space issue [08:38:15] jbond, ottomata - can you review your home dirs and drop what is not needed? [08:39:40] there is also /srv/home-install1002.wikimedia.org that you be reviewed/dropped as well [08:39:55] nothing super urgent but we should free space today if possible [08:52:27] I've removed /srv/home-install1002.wikimedia.org, that was an old copy frm a migration which is long over and also freed 5G in my home [08:53:54] <_joe_> elukey: you won't have my precious files [08:54:03] if we can empty /srv/wikimedia/incoming/ it would free up 1.8G [08:54:18] from the doc that directory is supposed to be empty most of the time [08:54:24] <_joe_> we have 25 gb free people [08:54:25] thanks moritzm! [08:55:07] <_joe_> but to elukey's point, of the 141G we have occupied there, onluyu 85 are in /srv [08:55:13] <_joe_> where the useful stuff resides [08:55:34] <_joe_> heh 51 GB in /home [08:55:59] I know that Andrew was working on the new anaconda-wmf deb that is big, I think the files are not needed anymore so if rushed we can drop them in theory [08:56:06] (but I'd prefer to wait for Andrew's confirmation) [08:56:16] <_joe_> yeah let's wait for jbond too [08:57:22] <_joe_> those two are indeed the big offenders :) [09:11:09] i have cut mine down to 175M [09:41:06] Delete All The Things! [10:06:03] folks I added some info about how to handle TLS certificate renewals for Kafka brokers (now that we have alarms) - https://wikitech.wikimedia.org/wiki/Kafka/Administration#Renew_TLS_certificate [10:06:24] ideally in the long term everything will be handled by puppet and the Kafka PKI intermediate [10:06:34] but more tests are needed [10:06:55] the new PKI is used by Kafka test, and soon we'll have the first certs to renew (in a couple of weeks I think) [10:07:30] if you see something not-clear/weird/etc.. lemme know [10:10:01] elukey: Excellent. Thanks for that. I didn't know that you could reload the certificates without a broker restart. [10:11:47] btullis: I discovered it recently, it seems to work, the software heritage people did a similar thing :D [10:12:26] <_joe_> elukey: did you test the procedure? [10:14:43] _joe_ yes [10:14:55] I also added a big WARNING as well [10:15:25] The command seems doing what's needed, the Kafka logs confirm etc.. but I haven't tested it for a real keystore swap [10:18:41] (I'll do it in the next days when the first warnings will pop up) [10:18:57] if we move all kafka brokers to the new PKI we'll need some automation, the default expiry is 4 weeks [10:22:19] 4 weeks seems quite a short period. Is it too late to extend the expiry period allowed by the Kafka intermediate CA? [10:27:55] I think that it can be changed, shouldn't be a big problem.. [10:28:27] but if we manage to force puppet to reload the certs consistently it may not be needed [10:28:54] it needs to be done very carefully of course [11:01:50] Can anyone advise where my cumin-fu is going wrong here please? `sudo cumin R:Class ~ "(?i)role::analytics_test_cluster::(client|coordinator)"` [11:02:52] Trying to follow the multiple role selection example here: https://wikitech.wikimedia.org/wiki/Cumin#PuppetDB_host_selection but it responds with: `cumin: error: -m/--mode is required when there are multiple COMMANDS` [11:03:47] btullis: sudo cumin 'R:Class ~ "(?i)role::analytics_test_cluster::(client|coordinator)"' [11:03:50] works fine [11:03:55] (quoting the query) [11:04:35] as otherwise it would try to parse the args as "R:Class" as query and "~" and "(?i)role::analytics_test_cluster::(client|coordinator)" as commands to execute [11:05:40] Ah, thanks volans. Makes perfect sense now. [11:06:22] anytime [11:13:28] out of curiosity, what does the "(?i)" do? [11:14:05] > (?i) allow to perform the query in a case-insensitive mode (our implementation of PuppetDB uses PostgreSQL as a backend and the regex syntax is backend-dependent) without having to set uppercase the first letter of each class path. [11:14:06] I don't know what's the context, but isn't that related to case insesitivity? [11:14:08] case-insensitive otherwise you have to write it CamelCased as Puppet they are in puppet [11:15:03] ah, if it is postgress syntax that makes sense where I have seen it before [11:21:21] thanks [11:25:17] actually it is a rather standar regex syntax, so not only for postgres [11:30:36] I added a small note to the Wikitech page about the need to quote multiple host selection queries. Seems obvious to me now, but still tripped me up. [14:54:07] marostegui: hello, got a minute? [14:55:56] hauskatze: hey what's up [15:37:05] This is a very vague problem statement but I have a host that is failing to boot into the debian installer properly - on a reboot, it does a pxe boot successfully, all ipmi commands succeed. It loads initrd.gz and there's a long pause, console goes blank and then the hosts just reboots into the OS already installed [15:37:20] any historical examples of similar issues with dells? [15:38:53] hnowlan: sounds familiar :D https://phabricator.wikimedia.org/T297422 [15:39:07] I got the same issue, I had to ask dcops to upgrade nic+bios [15:39:09] and then it worked [15:39:34] original task https://phabricator.wikimedia.org/T296856 [15:39:40] elukey: ah, good/bad to hear :D [15:40:04] the good part is that with an upgrade all worked fine :D [15:41:48] hnowlan: are those old nodes? [15:43:09] elukey: old-ish yeah, 2018-11 R440 (only hit one so far but I assume more will follow) [15:44:55] hnowlan: that's the problem I mentioned two days ago on IRC, I'm running into the same with the Ganeti servers in eqiad, these will also need firmware updates [15:45:34] moritzm: ahh I see - I saw your ticket linked in elukey's too :) [15:48:43] on the bright side, for the 16 cases where we had seen that error so far, the firmware update resolved it reliably [15:53:24] Sorry! This site is experiencing technical difficulties. [15:53:25] Try waiting a few minutes and reloading. [15:53:25] (Cannot access the database: Cannot access the database: Unknown database 'metawiki' (db1164) (db1164)) [15:55:45] I am trying to supress an abusive account and keep getting this ^ [15:58:17] huh [15:58:35] <_joe_> Amir1, marostegui ^^ [15:58:53] db1164 is s1? [15:59:06] I'm at a meeting [15:59:14] how large scale is the issue? [15:59:24] the stacktrace actually indicates that this looks like my work, I think it's mixing up db connections :/ [15:59:26] <_joe_> seems pretty serious, let me check [15:59:26] (should I leave the meeting early?) [15:59:34] can't suppress any global account [15:59:48] <_joe_> I need to go into a meeting [15:59:50] <_joe_> as well [15:59:56] <_joe_> but this seems like a serious issue [16:00:06] <_joe_> if code is suspected, let's rollback first, ask quesitons later [16:00:08] I can take over soon, let me know if you have a patch ready for review and deployment [16:01:06] I am not sure if this is happening for this very account I'm trying to nuke, or for others as well [16:01:14] let me test with another [16:01:18] okay, the meeting is over, let me see [16:01:40] https://logstash.wikimedia.org/goto/592ea2f850bcc277e8661ab5ba2dfbd2 [16:01:45] I think I found the issue [16:01:46] the error has been happening at low rates for a week [16:01:54] not sure a rollback would be helpful [16:01:57] Yep, I can't suppress any global account [16:02:09] or maybe not [16:02:41] <_joe_> cdanis: same server? [16:02:50] _joe_: no [16:03:05] a good spread of hosts, and looks like it has occurred on both .18 and .17 [16:03:47] db1175 now with "MarcoAurelio (test 2)" [16:04:00] shall I file a security task in the meanwhile for reference? [16:04:22] yup [16:05:17] Ack, creating [16:05:25] 2 CA bugs today I'm filing [16:05:27] nice [16:06:19] an I needed _joe_ , Amir1 ? [16:06:22] am I [16:06:44] marostegui: nah, I'm on it [16:06:45] hauskatze: could you add me? I would like to see if my patch is related (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/723790). [16:07:59] Amir1: https://phabricator.wikimedia.org/T299655 [16:08:15] zabe: if you're in NDA groups you should be able to see it? [16:08:29] otherwise I guess it's fine, given that you have Logstash access iirc [16:09:02] there is a "when added to LDAP wmf you automatically also get Phab WMF-NDA" nowadays [16:09:03] I have a nda, but I don't have sec issue access [16:09:16] It's weird, because from cdanis logs it affectef both userrights and centralauth suppresion [16:09:18] but maybe not the same for "nda", not the automatic part [16:09:28] feel free to reopen that ticket to ask for phab nda [16:09:44] zabe: added [16:09:53] thx [16:15:29] Amir1: let me know if I need to come online, I'm 5 mins away from my laptop [16:16:11] marostegui: nah, enjoy your evening for once [16:16:30] xdddd [16:17:57] * hauskatze Squawks 7700 [16:18:45] Amir-1 and myself are investigating [16:42:11] zabe: https://integration.wikimedia.org/ci/job/mediawiki-quibble-vendor-mysql-php73-docker/11481/console :((( [16:42:45] sorry :( [16:47:47] Amir1: would it be possible to only merge the revert into wmf.18 for now. I'm fairly certain the failure is only due to me not being able to write good tests. [16:48:12] zabe: that is da plan [16:48:32] ok, thx [16:48:37] zabe: actually you wrote too good tests ("brittle tests") [16:51:09] zabe: actually I think the tests breaking prove that my fix works :D [16:52:01] I'm not even sure if it's possible to test that something correctly reads from a separate database :/ [16:52:43] yeah, I remember writing some tests for UserRightsProxy and having these issues [16:52:46] apergos: mind if I rename the incoming dumps NFS servers (formerly labstore100x) to 'dumps100x'? [16:52:57] um [16:53:10] we have servers called "dumpsdata100x' [16:53:11] " [16:53:17] so that might be a bit close [16:53:40] dumpsdistrib(ution)100x? [16:54:02] that's a mouthful [16:54:08] apergos: we'd talked about 'clouddumps' but I'm not sure that the 'cloud' prefix is really anything there [16:54:17] although it would make it easier for me to find them in icinga [16:54:33] what do the rest of the labstore boxes get renamed to? [16:54:48] is everything going to be cloud_x for some value of x? [16:56:03] you kno what [16:56:26] datasets (I know, we used to have dataset100x) [16:56:46] apergos: we're moving towards not really having other cloudstores. NFS is moving onto the cloud soon (I hope) [16:56:51] because sooner or later we might stop calling all these datasets dumps and start calling them public community datasets or something like that [16:57:05] ok, so datasets100x? [16:57:13] Not just 'data100x'? :p [16:57:13] let's see what robh says [16:57:28] because that's very close to reusing an old name and he might not like it [16:57:37] (but I do) [16:58:05] are the other wmcs hosts cloudX or do their names vary? [16:58:46] 100% cloudx [16:58:49] oh [16:58:51] um [16:58:53] zabe: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/FlaggedRevs/+/755727 🥺 [16:58:57] clouddatasets [16:59:00] But there's nothing fundamentally cloudish about these new hosts [16:59:07] long but not as long as dumpsdistributionthingie [16:59:14] no, just that they belong to y'all [16:59:29] well the one provides nfs to wmcs instances, that's cloud-ish :-P [16:59:36] true [16:59:59] arturo: was suggesting that these boxes not be cloud* but I don't know if he had an offical rationale beyond less to type [17:00:00] so clouddumps or clouddatasets is fine [17:00:06] imo [17:01:18] dumpscloud so it does not get selected by cloud* but also is not like dumps without cloud and nobody is happy [17:01:19] I like the notion of having the `cloud` keyword for servers that are fundamental part of the core WMCS underlying infrastructure [17:01:44] +1 to mutante's suggestion [17:02:22] wmcsdumps (and that will guarantee that the team gets renamed and so does the service) [17:02:39] anyways you have my thoughts, feel free to poke me on the relevant task if you don't come to an agreement [17:02:51] or need more of my esho [17:02:57] thanks :) [17:03:09] but at the end of the day, if you all use cumin aliases and not just glob/wildcard on DNS names then you can customize it all as you like [17:03:16] in cumin aliases file [17:03:51] at least one person (and maybe more in the future) will have access to the dumps related servers without having cumin access, just fyi [17:03:58] now I really am going to wander off [17:04:03] apergos: uhh, talking hostnames? i dont have a horse in this race as long as you list it on the wikitech naming standards page [17:04:11] and they need to be short enough nto fit on label [17:04:40] I stopped having a horse in the hostname race in 2018. [17:04:46] ;D [17:04:49] there you have it folks, when you settle on a color for the shed lemme know [17:04:59] the main thing being the wikitech page gets updated [17:05:20] https://wikitech.wikimedia.org/wiki/SRE/Infrastructure_naming_conventions [17:06:09] datasetsX will fit better on label from a purely character perspective [17:06:27] granted a label is NOT the deciding factor, but it is an easy guideline [17:06:47] the deciding factor is who argues loudest in the sre meeting when its brought up that someone hates the name you picked ; D [17:07:26] one could argue that [17:07:44] nope I'm not even. I"m really actually goign to do mindless twitter surfing or something [17:10:40] the servers need a little e-ink display instead of LCD, then display their "physical label" hostname on that so it stays as it is even when power off. /me hides [17:16:36] the lcd used to do that but we had to program it manually and now they dont have lcds [17:30:39] thanks taavi and Amir1 :) [17:30:58] happy to be of service [17:36:45] I guess I'll sit down now and fix the tests now [17:37:02] lmk if I can help [18:49:28] /nick robh [18:49:44] fail [19:31:08] since just now the old Bugzilla tickets are now a kubernetes service. eh, I mean "the archive of all the old tickets is". https://static-bugzilla.wikimedia.org/ [19:31:32] just switched from ganeti to k8s [19:38:38] good ol' bugzy [19:42:45] yep, living in the future to stick around forever [20:58:17] !log rebotting mx1001 to test new kernel [20:58:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log