[00:17:13] hi all, wanted to point out that puppetdb is acting a bit delicatly at the moment, more information: https://phabricator.wikimedia.org/T263578#7246024 [00:17:57] this has now caused issues two days in a row so i woud encourage all sre's to read the tickt and ask for more information if needed [00:18:58] further input into the main task and all linked tasks most welcome [06:08:02] <_joe_> jbond: ouch when I checked yesterday things seemed stable [08:23:30] hi, so I have an infra question: do we have any kind of cross DC shared storage that can be attached to a host / ganeti vm? [08:23:54] and or / is there a distributed shared storage available in our kubernetes [08:24:27] the context is maybe I might want to move doc.wikimedia.org static assets to a shared storage, which would make switch over from a dc to another one easier (no more have to rsync ton of things between hosts) [08:38:26] <_joe_> hashar: oh easy answer: sort-of, and no, respectively [08:38:48] <_joe_> hashar: one way to do it is to move the storage part to swift [08:39:04] <_joe_> and write to both datacenters when you generate your stuff [08:39:29] <_joe_> but when you talk about static assets, do you mean docs that get generated? [08:39:53] _joe_: yes the generated documentations or coverage reports hosted there [08:39:57] <_joe_> do you have an outline of the current process? [08:40:17] <_joe_> another q: how much data are we talking about? [08:40:41] an example is: a job poll mediawiki/core , when it notices a change it generates the doc on a CI WMCS instance and the output is rsynced to doc1001.eqiad.wmnet (a ganeti vm) [08:40:53] <_joe_> ok [08:41:02] if I want to switch over the service to an hypothetical doc2001.eqiad.wment , I gotta rsync all the data [08:41:04] <_joe_> ugh [08:41:10] <_joe_> yeah gotcha [08:41:16] <_joe_> so how much data is it? [08:41:23] I have the same issue with Jenkins build artifacts that are solely on contint2001 (the primary) [08:41:50] for doc it is probably < 100G, for Jenkins it varies but roughly 200GB (we dish out artifact after 7 or 15 days) [08:41:54] <_joe_> yeah we won't provide NFS, either in a single dc or (god forbid) across DC [08:42:04] <_joe_> ok so nothing that can be built in a docker image [08:42:42] my idea would have to have the doc service (aka running apache) to be on doc1001 / doc2001 [08:42:49] have the shared storage mounted on those hosts [08:43:03] so when I publish the material to the shared storage they are available to both hosts [08:43:32] then it is easy to switch over. Or if one host fails, we have the other one with all the content. [08:43:39] or we could even load balance between the two [08:44:10] for Jenkins it is similar, though ideally I would have two active/active ones [08:44:37] anyway, looks like pushing to Swift and attaching the Swift container to the host would be the solution [08:45:14] <_joe_> yeah shared storage with a local filesystem interface is the worst of leaky abstractions and I consider it an antipattern generally [08:45:44] or is there a way to have Apache serve files directly from Swift (and avoid the filesystem interface)? [08:45:59] <_joe_> hashar: not sure any webserver has that option directly [08:46:40] so request > apache > local fs mount > Swift [08:46:46] no idea whether that is even possible [08:48:14] there is some doc about it at https://docs.openstack.org/swift/latest/apache_deployment_guide.html [08:49:51] looks like we never used that kind of system or shared storage is not an off the shelf service :-] [08:54:48] <_joe_> hashar: why local fs mount? that's not how swift is supposed to be used. [08:57:42] _joe_: I have no idea man! [08:58:32] I barely know anything on the topic of distributed storage, guess I am looking for a solution we might already have and that I could mimic [08:58:56] <_joe_> we don't :P [09:02:29] yeah you'd be proxying http from apache or similar to swift for reads [09:02:48] for writes any s3 client will do [09:03:12] so to sum up: [09:03:26] I'm hoping to get MOSS set up in the next couple of months or so, you can use that https://phabricator.wikimedia.org/T279621 [09:04:11] for distributed storage I should go with Swift. There is no previous case of serving files from Apache. We don't have Swift with Kubernetes (yet I guess) [09:05:04] godog: so that MOSS thing would be some kind of storage as a service? [09:05:27] and I imagine at one point if I want to add that to an app I would "just" have to apply a puppet class to have it available? [09:05:41] object storage but yes, basically swift for all misc usage [09:05:59] that sounds excellent [09:06:06] I have created https://phabricator.wikimedia.org/T287740, if someone knows who to tag/subscribe, that'd be great [09:06:07] in a perfect world that'd be what needed, in reality it won't get so simple [09:07:34] marostegui: stack trace reveals it's ProofreadPage, so tagged that [09:07:43] majavah: thanks a lot [09:07:53] godog: I will check with Mukunda, he might have a similar need for Phabricator assets [09:08:01] I have subscribed to the task meanwhile [09:08:49] hashar: to your previous points, afaik no prior art on reverse proxying from apache to swift no, and this is orthogonal to k8s as in swift is a service like any other whether you are running in k8s or not, swift itself doesn't run within k8s but it is accessible from it [09:08:53] hashar: sure SGTM [09:09:42] hashar: out of curiosity, could you point me to a repo or a task where I could read more about the current doc.wikimedia.org processes? I'm just interested. [09:09:51] godog: I mentioned k8s, cause we might want to move doc.wikimedia.org from ganeti vm to k8s [09:10:31] btullis: good morning and pleased to "meet" you ;] [09:10:52] btullis: there are some documentation at https://wikitech.wikimedia.org/wiki/Doc.wikimedia.org [09:11:48] Likewise. Thanks for that. Looking forward to working with you. [09:12:17] btullis: in short we have CI rsync generated doc to a machine and a home made PHP app to generate the boiler plate page ( examples: parsing coverage results to build https://doc.wikimedia.org/cover/ , a file browser: https://doc.wikimedia.org/mediawiki-core/ ) [09:12:46] all the content are just flat files on a single VM, which is not that ideal [09:13:07] then we can afford to loose the content since we can probably regenerate most of it (though that would be painful) [09:13:53] I will check with others from releng [09:18:35] performance/arc-lamp seems to match . Thank you btullis and godog [09:19:00] Nice. Thanks for filling me in. Looks to me like a great candidate for both k8s (for the reverse proxy component) and swift. [09:19:40] and one less legacy infra to maintain [09:20:06] hashar: doh! of course, thanks for pointing that out [09:20:18] yeah I am reading https://phabricator.wikimedia.org/T244776 [09:20:50] looks like Dave Pifke (maintain the infra for the performance team) did all the exploration [09:21:01] so I can probably follow that trail [09:21:32] anyway. I had all the answers I was seeking :] [10:26:06] _joe_: re pupetdb, it seems to be triggered by postgress maintentnce [10:28:14] <_joe_> heh still doesn't explain the codfw/eqiad difference in perf degradation [10:40:23] You will see a warning on Backup alerting saying "No backups: 6 (dbprov1001, ...)," [10:40:56] That is expected, some backups have been renamed so new backup jobs have 0 past backups [10:41:19] but old backups are still available for recovery, no issue there [10:45:42] _joe_: no it dosn't but we see slow queries in the postgress log matching the same slo processing simes in the puppetdb log so it dose seem like the main culprit is postgress with likley some other issue which make codfw react worse to the problem [10:46:39] <_joe_> ack [14:08:13] godog, cwhite thanks a lot for the reviews on the icinga exporter, I sent a few more patches with a 'first step' kinda proposal for the file, let me know what you thing (on monday/whenever you have time) [14:35:38] dcaro: neat, thank you ! I'll take a look [15:32:01] when loading netbox pages, sometimes I get "unable to connect" and then it's gone after next reload, almost like it has 2 backends but one of them isn't working [15:34:27] netbox.wikimedia.org points to netbox1001.wm.org.. maybe IPv4 VS IPv6 issues? [15:34:53] I've been having that issue for the longest of time, volans has looked into it at some point [15:34:57] but so far it has always been just me [15:36:11] aha! thanks, so not just me. I have only been noticing it the last 2 weeks or so [15:36:25] since I'm i Europe I think [15:37:18] could also depend on which wifi I am using [15:48:37] I'd expect happy eyeballs should mask any IPv4 vs IPv6 issues. [15:49:12] Have not had this issue myself, but will do a few checks when I get a minute see if I can find anything. [15:49:29] thanks topranks, not urgent [15:50:43] If it happens regularly enough that you could grab a pcap of it happening that would definitely help. But no worries will have a quick look anyway. [15:56:22] of course now it stopped happening, heh [15:59:17] you took the car to the mechanic [15:59:32] and of course it now works like a rolls-royce just out of the factory [15:59:49] haha, yes, though the _actual_ car mechanic just mailed me about all these things they had to fix :) [16:00:03] oh ofc, that's when you do a routine check [16:02:43] :) yea, exactly, 600 dollars for more or less replacing spark plugs that wasn't approved [16:03:22] mutante: I'm afraid you're gonna need a new motherboard there [16:03:28] Probably need to swap out all the fans too. [16:03:37] looking at a few grand minimum :) [16:04:33] topranks: that's an honest computer technician [16:04:34] topranks: lol, yea, I bought the car for 2,300 :) also.. I just emailed you a pcap file with 6024 packets, there was one of the issue near the end [16:04:50] ah great thanks :) [16:05:10] one shop asked my brother-in-law 300 euros for *changing a ram bank* [16:05:19] in a desktop computer [16:05:39] Good day, could I get added to .users in pw? I believe I've added my key correctly following https://office.wikimedia.org/wiki/Pwstore [16:11:49] mdipietro: PMing you about the procedure [16:11:58] Thanks [16:12:01] mdipietro: hi, and welcome! And yes, sadly that's often a bit cumbersome [16:32:00] welcome aboard mdipietro. so vi or emacs? [16:32:59] hashar asking the serious questions in week 1 :) [16:35:04] 4 keys not found, >1 invalid, like everytime, but we'll get it fixed eventually :p [16:39:48] hashar: that is an intense question, but I'll fess up, vi to the extent that my bashrc has "set -o vi" :p [16:40:21] vim [16:44:31] we talked about the pwstore procedure and I pinged some with expired key etc. going afk for now [16:50:29] mdipietro: :-]]] [16:50:56] :)