[09:51:44] relocating [09:59:07] lunch [11:46:01] Lunch [14:51:32] \o [15:00:22] hmm, wcqs1001 and 2001 are happily importing. [12]00[23] are all returning service unavailable from jetty :S [15:00:42] well, i guess we only really need one per dc anyways :P [15:01:40] o/ [15:03:24] looks like I've lost my home on mwmaint machines :/ [15:05:36] ebernhardson: I think it's fine to import a small portion of the whole dataset if that will help testing further the config [15:07:11] I mean I don't think it's important to have the full dataset there yet since it's likely it'll have to be fully reloaded before going live [15:09:33] dcausse: codfw? [15:09:39] RhinosF1: both [15:09:40] It got reimaged after the switchover [15:09:53] Eqiad got reimaged during switchover [15:09:59] dcausse: do you need it? [15:11:30] RhinosF1: hm... looking if I can find what I need elsewhere [15:23:16] ebernhardson: I wonder where does this service unavailable comes from, you have any guesses? [15:35:26] zpapierski: not sure yet, in meeting will look in a sec [15:43:15] doesn't really say. wcqs-blazegraph.log has nothing useful. I only have 2k lines of scrollback and it all says 503 service unavailable [15:44:07] i don't know what they should look like, but there are 1367 munge files ot 20-40M each, seems plausible [15:44:38] yep, I'd expect something like that [15:48:53] http://localhost:9999/bigdata/ (when forwarded) also 503's :S [15:49:12] on wcqs1002 at least...blazegraph is a pain :P [15:49:25] it should at least be writing things to the error log [15:51:14] on wcqs I only see log up to Sep 15 - is /var/log/query_service/wcqs-blazegraph.log correct path? [15:51:26] wcqs1002 I mean [15:51:45] zpapierski: should be, you can `sudo fuser /path/to/file` to see if blazegraph has the file open [15:52:01] (it does) [15:52:25] 15 sounds about right, when they first came up [15:52:58] wcqs-blazegraph.service: Changing to the requested working directory failed: No such file or directory [15:53:16] that sounds incorrect :) [15:53:22] oh sorry that's old [15:53:32] Sep 15 is the last modification date, I mean [15:53:52] oh, duh...david reminded me the other day blazegraph complains to journalctl and not it's own logs [15:54:39] still sept 15 there [15:54:44] I did as well, but that won't help [15:54:45] yeah [15:54:57] might just need a restart? [15:55:03] it's probably this one: Sep 15 22:30:49 wcqs1002 wcqs-blazegraph[2917]: java.io.UncheckedIOException: java.nio.file.FileSystemException: /var/log/wdqs/query_event.log: Read-only file system [15:55:20] (i have no clue why it keeps saying read-only, the fs is never read only. Rather the directory doesn't exist) [15:55:53] wasn't this suppose to dissapear with the latest puppet change? [15:56:14] file query logger was removed, right? [15:56:31] yes, it probably just needs a restart (but double checking) [15:56:45] ah, if it wasn't it does [15:57:01] blazegraph instances aren't restarted during puppet update [15:57:26] doesn't explain why wcqs1001 works -was it restarted? [15:57:56] zpapierski: i believe it was, not sure about 2001 but since it's working it must have [16:01:53] alrighty, looks like they are all happy enough to return 200 for readiness-probe [16:02:00] now i just have to find someone willing to deploy the LVS bits :P [16:02:28] great :) [16:02:49] it's not clear who that is, i think i have to con gehel into it somehow :P [16:03:48] The LVS bit has a bit of black magic associated with it. We need to at least check how all that work with traffic. [16:07:27] i can ask, not expecting much though [16:44:36] dcausse: someone asked if there is any documentation for "wikibase:hasViolationForConstraint" that can be linked to, with regard to the new streaming updater [16:45:25] mpham: checking [16:48:30] I see 9 pages mentionning that in https://www.wikidata.org/w/index.php?search=wikidata%3Ainsource%3AhasViolationForConstraint&search=wikidata%3Ainsource%3AhasViolationForConstraint&title=Special:Search&go=Go (including your annoucements) [16:49:07] and only one on mw.org in https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Constraints but it has a warning that it's marked as "will be discontinued" [16:52:00] if there's an official doc about wikibase:hasViolationForConstraint and the query service I'm not sure where it is [16:52:21] mpham: do you have a link to the question maybe I'm missing some context? [16:53:53] I see it now [16:56:47] I'm not sure to understand why they want to add the doc for this at this moment now that we are about to disable it... [16:58:46] yeah, i'm not entirely sure why they wanted the link other than maybe to keep things updated [17:51:50] ebernhardson: looking over https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959 and https://wikitech.wikimedia.org/wiki/LVS#Add_a_new_load_balanced_service now [17:52:28] ryankemper: i don't really know if any of that is right :) Mostly just following the docs and looking at other patches from people adding a service [17:52:53] dcausse: are you still missing home stuff on mwmaint btw? I seem to remember Trey314159 had the same problem during the last reimage of mwmaint* and they resolved it by restoring w/ bacula [17:52:56] which is to say...it may mostly have the right invocations but i wouldn't really know if anything is missing [17:54:54] dcausse, ryankemper: yep, I had to get it restored. They did have everything, but it took a little time because restoration was a little more complex because they wiped the server. [18:03:00] ebernhardson: you're probably as confused as I am with the order of operations here, but do you think I should take care of https://gerrit.wikimedia.org/r/c/operations/dns/+/713929 before https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959? [18:03:46] ryankemper: i'm thinking deployment goes back to front, following the request path. So we make wcqs servers happy, then lvs talking to servers, then trafficserver forwarding through lvs, and finally dns to resolve public requests to trafficserver [18:04:03] ryankemper: not strictly required, we could probably out-of-order a variety of pieces, but it was easier to think in a straight line :) [18:04:28] okay, that makes sense to me. Was asking because the https://wikitech.wikimedia.org/wiki/LVS#DNS_changes_(svc_zone_only) step comes before the service catalog stuff in the docs, but I think that reasoning makes sense [18:04:53] ryankemper: ahh, i might be mixing dns changes, the internal service dns needs to come before LVS patches, i think [18:05:01] ryankemper: i was thinking hte public dns (commons-query.wikimedia.org) [18:05:27] ryankemper: the internal dns i think we can ship now, if it's right [18:05:53] okay, let me get the manual netbox step in place so we can get the internal dns patch shipped [18:14:19] ebernhardson: for the IPs in https://gerrit.wikimedia.org/r/c/operations/dns/+/713929/4/templates/wmnet, are those just stubs currently? for ex `10.2.2.63/32` is listed in netbox as `inference.svc.eqiad.wmnet` [18:14:53] I think it might be only SREs that can see netbox (I don't remember) but here's the netbox entry for that https://netbox.wikimedia.org/ipam/ip-addresses/8994/ [18:15:28] nifty, it let me log in [18:16:14] ryankemper: looks like someone else took .63 already, just need the next open number [18:16:53] `10.2.2.67/32` then? [18:18:13] ryankemper: looks like it, seems theres a variety of new shellbox svc names [18:31:50] ryankemper: meeting? or are you focused on those puppet patches? We can reschedule for tomorrow. [18:32:14] gehel: heads down on the DNS, let's do same time tomorrow if that works for you? [18:32:30] works for me [18:33:26] I'll end my day early! [18:33:53] ryankemper: leave me a message here if there are any CR that I need to follow up on tomorrow morning. [18:34:00] gehel: ack [21:05:27] ryankemper: how goes everything, looks like internal dns is working [21:14:46] ebernhardson: yup internal dns is all deployed; reading thru the lvs documentation now to make sure we're not missing any steps for lvs part 1 [21:16:02] ebernhardson: so far the only thing I can see is that I need to update the ips in the service catalog: https://old.reddit.com/r/Drugs/comments/12v5f8/research_suggests_no_neurotoxicity_in_mdma/c786sjp/ looks like l.egoktm pointed that out as well [21:16:39] probably wrong link :) [21:16:57] lol...I'd say so [21:17:00] https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959/2/hieradata/common/service.yaml [21:17:09] * ryankemper yells at clipboard manager [21:17:29] yup, those need to be updated to match what we put in netbox, otherwise hopefully it's ok [21:17:37] ebernhardson: I guess the other thing that jumps out at me is with 3 wcqs hosts right now, we might want the depool threshold at like .33 instead of .5 [21:18:20] I know we did that for wdqs internal, but that might be only because it was an internal service [21:18:24] hmm, i have to check what that does :) sec [21:19:21] "the percentage of the cluster that will be kept pooled by Pybal even if checks fail". [21:19:42] so .33 would say if the last host fails keep trying anyways [21:20:12] tbh i'm not sure, i suspect the goal there is so that if something goes off the rails lvs doesn't depool the whole service by accident [21:20:36] .5 would mean it keeps 2, so i suppose .33 would be reasonable [21:21:18] ebernhardson: yeah additionally I believe we page when pybal wants to depool hosts but no longer can (because of the depool threshold), so it changes how many nodes go down before paging I think [21:22:09] We used `.3` for `wdqs-internal` which has only 3 hosts [21:22:28] ryankemper: makes sense, We should probably go with .3 here then as well [21:42:12] ebernhardson: changed the depool threshold and the IPs. currently pcc is failing on codfw but succeeding on eqiad; the error message makes me suspect we're missing some hiera values: https://puppet-compiler.wmflabs.org/compiler1003/951/wcqs2001.codfw.wmnet/change.wcqs2001.codfw.wmnet.err [21:43:41] it's not yet clear to me why eqiad is compiling and codfw isn't however [21:50:28] hmm [21:54:44] I don't get it either. It seems to be complaining that $facts['numa']['nodes'] is undefined. $facts come from Facter or custom facts(not sure what we have there). For the PCC case i think something exports facts from prod and then pcc reuses them [22:06:58] ryankemper: pcc is broken, possibly for all of codfw. An empty patch also fails: https://puppet-compiler.wmflabs.org/compiler1001/953/ [22:07:23] I suspect https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet-diffs#How_to_update_the_compiler's_facts?_(e.g._INFO:_Unable_to_find_facts_for_host_conf2001.codfw.wmnet,_skipping) would fix, checking who is in that project [22:11:03] right, 4 days 11 hours since last successful build here: https://integration.wikimedia.org/ci/label/puppet-compiler-node/ [22:11:52] I think that timing probably implies something I merged on thursday broke it for wcqs [22:13:41] ryankemper: hmm, with lvs2010 failing in a similar way i'm suspecting it's not directly related to anything you did. Rerunning my same empty patch against lvs2010 also fails pcc : https://puppet-compiler.wmflabs.org/compiler1003/31181/ [22:14:04] ebernhardson: ah, yeah if it's for arbitrary codfw hosts then I agree [22:15:06] ebernhardson: ah there's context in today's backlog in #wikimedia-sre [22:16:13] ahh! indeed there is. And someone even seems to know whats actually wrong :) [22:16:26] sounds like nothing we can do though, hmm [22:18:37] yeah guessing it won't be fixed until europe wakes up [22:19:02] there aren't a whole lot of people on this coast :) mutante might still be around [22:19:08] (but might otherwise be occupied) [22:19:14] well, I feel pretty confident that there's no errors that would be caught by pcc in the lvs patch [22:19:17] seems fine to wait [22:19:17] (famous last words ofc) [22:19:25] i have to run in 20 minutes anyways, Liam has a doc appt at 4 [22:19:36] but out of caution we should wait till tomorrow [22:19:38] agreed [22:19:43] sounds good [22:19:50] alrighty, sounds agreed :)