[04:01:10] Some context in #wikimedia-operations, but TL;DR is codfw fell into near-total unavailability for a couple hours [04:01:42] l.egoktm and I looked into it, blindly banned a dailymotion user agent which was by far the most common user agent, and codfw immediately went back to working again [04:01:47] I'm sending an e-mail to their team right now [04:36:43] https://phabricator.wikimedia.org/P17256 here's the user agent (and also a random query, not necessarily a problematic one)...might be worth spending a few minutes seeing if there's anything obviously crazy about their queries, for our own knowledge's sake (cc dcausse / zpapierski) [06:40:15] ryankemper: thanks! we saw this UA (querying us from 20 gcs IPs IIRC) as well but we did not take actions since it was not "new" and our assumptions back then (last friday) was that it must be a "new user". Banning daylimotion for a while (7 days) makes sense and we should learn something in all cases [06:43:49] dcausse: makes sense. yeah I observed that it wasn't new as well...in a backwards way it was fortunate that the service was almost completely unusable because it made it very easy to justify trying out the ban :P [06:44:11] sure! :) [06:45:04] I'm going to postpone lowering the thread limit to make sure we don't mix-up all this [06:46:10] we should know soon, the holes (for wdqs2*) in https://grafana-rw.wikimedia.org/d/000000489/wikidata-query-service?viewPanel=32&orgId=1&refresh=1m&from=now-7d&to=now are the time blazegraph was stuck [07:33:08] dcausse: making sure I understand this comment - https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/683687/comment/45636578_1abc6349/ - having a broken jar in isn't a problem? [07:34:52] zpapierski: previously it was trying to optimize by no re-uploading, now it re-uploads all the time even if the jar name is identical so a broken jar can be re-uploaded [07:35:09] ah, ok, then it's cool [07:35:27] https://gerrit.wikimedia.org/r/c/wikidata/query/deploy/+/683687/22/flink/flink-job.py#206 [07:35:39] jar is always deleted [07:39:58] I'm looking at Erik's patch on UriScheme (https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/716076/2) and I'm trying to remember why we didn't make Serializable UriScheme related classes from the beginning... I'm sure that this had to come up [07:41:38] I think we found it was not necessary as builing it from the hostname was sufficient, so passing String host vs UriScheme we preferred the former I suppose [07:57:29] looking into Alertmanager, do we want to setup our own team there and have notification in this channel + disco-alert email or stay with wikimedia-operations? [07:58:05] for WDQS we want a separate channel until we are stable enough to notify everyone [07:58:15] Probably the same for WCQS [07:58:32] For Cirrus, I think we can stay with the usual -operations [07:58:48] ok [07:59:01] setting up a new team there then [07:59:06] cool! [07:59:57] I agree, I actually think we should keeping a seperate channel forever - imho it is very difficult to communicate on channels that handle personal communicationand alerts [08:01:34] relocating [08:01:43] you mean separate as: #wikimedia-search-alerts? [08:02:15] or wbqs-alerts, but yeah [08:02:15] we can discuss that in gerrit I suppose [08:02:20] let's [08:33:43] who did we have to ban? :O [08:46:52] dailymotion, it looks like [09:03:25] addshore: P17256 [09:03:38] https://phabricator.wikimedia.org/P17256 [09:19:10] dcausse: are you aware what the two Flink instances of flink on yarn are for? [09:19:26] they've been apparently launched on Tuesday? [09:32:52] also - is there anything I can do to push WDQS forward? since Erik is working on streaming updater for WCQS, I could probably pick up something else there, like authentication with JWT, but I rather help make the selected deadline for migration [09:53:11] zpapierski: looking [09:53:57] this is me testing with 1.14, killing [09:54:44] ah, ok [09:54:46] both? [09:55:00] for wdqs remaining work is the alerts and finishing https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/670242 [09:55:59] huh, it's been in WIP for some time - want me to finish that? [09:56:06] the other is 1.13.2 so probably me again [09:56:24] zpapierski: sure, I think we want to re-work it a bit [09:56:54] as-in supports case when we transfert from one DC to another [09:57:26] we won't be transferring offsets in that case - timestamps? [09:57:32] yes exactly [09:57:36] ok [09:57:48] approximate with timestamps [09:58:12] so we transfer a timestamp and then approximate on the receiving end an offset [09:58:15] I guess the "difficulty" is to know when to do this timestamp approximation [09:59:08] and how to approximate, since events in kafka aren't required to be in proper order - unless that's not the case? [09:59:17] in normal case (e.g. when the topic and/or something identifying the kafka cluster) uses the provided offsets otherwize use timestamp [09:59:55] we'll assume that the timestamp of the two DC mutation topics are roughly equivalent [10:00:38] replaying same diffs should be OK [10:01:01] by equivalent you mean that the order is kept? [10:01:07] (or semi-kept) ? [10:02:46] timestamps in the mutation topic is the kafka ingestion time [10:02:53] it's not the event time [10:03:59] I have thought about using event time as the kafka timestamp but it fell it could be massively out of order causing the kafka timestamp index to be a bit weird [10:04:16] I see what you mean [10:04:27] but won't mirror maker crew with ingestion timestamps? [10:04:34] s/crew/screw [10:04:52] mirror maker should replicate the kafka timestamp provided [10:05:11] ah, in that case it should be perfectly fine [10:06:00] this kind of transfer codfw -> eqiad only works in ideal condition: no backlog (event time ~= kafka timestamp) [10:07:09] sure, the state vs repeated events might cause inconsistencies [10:07:27] (for a short while at least) [10:07:50] yes but that is unavoidable with this active/active setup [10:07:54] yeah [10:23:02] lunch [10:54:34] relocating&lunch [12:30:56] We're starting to see a few project ideas for our hackathon: https://docs.google.com/document/d/1g1tPPWuiOTNBsH5-vK-7BEb_Esal3PWmHCosIwPTvd8/edit#heading=h.csq0bybqlx0i [12:31:02] feel free to add your own [14:26:43] ryankemper: i saw your email about banning the user on codfw and that that 'immediately restored full service availability'. Do you have any recommendations on whether or not we still need the cron job to restart the servers at this point? [14:39:23] errand [14:39:40] might be couple minutes late for the retro [15:01:56] \o [15:17:33] o/ [15:32:31] ryankemper: do you have a couple minutes now for discussing cookbooks? [15:33:17] dcausse: yup, wanna make a room? [15:33:21] sure [15:33:42] ryankemper: https://meet.google.com/fvz-noyi-cgh [15:34:34] mpham: it's helpful as a general defensive measure. we could try lifting it and see if stuff stays stable though [15:35:08] I still see some minor patchiness in https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1631176171257&to=1631201493847 which suggests occasional deadlock from normal usage [15:42:26] one thing that occurred to me is before putting this hourly restart in place, we would occasionally have servers get backed up by several hours on lag [15:43:33] and the reason would be that they would be deadlocked for several hours (therefore not processing updaters from the updater), then would either be manually restarted or naturally get out of the deadlock or be incidentally restarted as part of something else (deploy etc), at which point blazegraph starts processing updates again and "realizes" how far behind it is [15:43:53] so even if we got rid of hourly restarts doing every 3 or 6 hours would be helpful in avoiding that from happening [15:45:03] yes some servers tend to stall but they get killed by the JVM with an OOM iirc [15:45:21] we could tune the jvm to be more aggressive on that perhaps kill earlier [15:48:29] regarding codfw today after the ban, the blips are very so I'm not sure the cause is the same as before the ban [15:48:39] s/are very/are very minor/ [15:49:06] yeah agreed [15:49:20] I'm not opposed to removing the hourly restarts and seeing how things fare [15:53:19] me neither but I'm fine to keep it for a couple more days too [16:00:42] I'll remove it today so we can observe [16:00:54] ebernhardson: I'm only now reading last standup notes (which I really should've find time to do during conference) - WCQS has a different favicon and logo than WDQS [16:01:56] for my live I cannot remember how they are served, but I'm guessing that's not really important since they won't be served the same way through the microsite, right? [16:02:19] zpapierski: i know how they are served, but i don't know where to get the files [16:03:12] zpapierski: basically, they will live in sites/wcqs/* in the gui-deploy repo. Any file requested from the root of the domain that does not exist in the gui build will source from sites/{wcqs,wdqs} [16:03:50] (and then we exclude the favicon.ico and logo*.svg files from being added to gui-deploy by the build) [16:05:24] logos are in puppet [16:06:42] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/query_service/files/gui/ [16:07:18] dcausse: where? I only see 4 dcausse thats the favicon, what about logo and logo-embed? https://query.wikidata.org/logo.svg [16:07:27] dcausse: ignore first half, i see now :) they aren't svg's [16:07:45] bah, i can't communicate today... I mean puppet has the .ico but i also need to find the .svg's [16:07:47] it's sourced from commoons IIRC [16:08:00] s/oo/o [16:08:09] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/query_service/files/gui/custom-config-wcqs.json [16:08:16] https://upload.wikimedia.org/wikipedia/commons/3/39/Commons-logo.png [16:08:38] i also hadn't realized these are here, but they are also in the gui-deploy repo? I had been under the assumption configs in gui-deploy were canonical [16:09:36] ohh! this custom-config is served from nginx [16:09:56] but the existing wdqs microsite deployment shouldn't ever route /custom-config.json to the nginx instance [16:10:26] can we do the same in the microsite? [16:10:55] and move all these resources there? [16:11:12] my preference would be to take all of this out of puppet and put it in gui-deploy [16:11:23] but maybe because i was already thinking that way before i found out we have config in puppet too :) [16:12:05] I can't remember why they ended up in puppet tbh so I'm fine with your suggestion :) [16:12:20] I have a nagging feeling we had some good reason for that [16:12:24] for the actual traffic routing, this would work though. The new plan, to support oauth, is ATS sends all traffic to nginx, and then nginx routes to the microsite. This is opposite to wdqs where ATS does all the routing [16:12:41] so we could leave it in puppet, if needed or desirable for some reason [16:13:01] I'm not sure it matters with new deployment anyway if it's in puppet [16:13:23] hmm, maybe a commit message would clear things up [16:13:31] I wonder...should we change wdqs routing to match wcqs? Just so we don't eternally have to remember how they vary [16:15:47] I'm not sure, I think that oauth solution should be temporary (I know, that's hilarious) and we should end up with sparql endpoint served via api gateway + ui modified to fit that [16:16:13] otoh, having less differences sounds easier [16:17:05] mostly i was thinking of that because right now, if we deployed, a custom-config.json for wdqs would go ats->microsite->gui-deploy, but wcqs would go ats->nginx->config from ops/puppet [16:17:19] seems quite surprising to anyone who didn't already know [16:17:57] but if we take it out of puppet and make wqs by ats->microsite->gui-deploy, and wcqs is ats->nginx->microsite->gui-deploy, then maybe not a big deal [16:22:22] I wish I could find this discussion somewhere, but if I remember correctly, puppet was about being able to reshape WDQS ui without changing much of the original source [16:23:13] perhaps we wanted to have the gui-deploy repo only modified from the gui build process? [16:23:15] we added some custom config options and then proxied it through nginx to be able to change it with the class of the instance [16:24:07] dcausse: yeah, basically - everything needed to make this work was in puppet since WDQS gui reads custom-config directly from http, ergo, from nginx [16:24:35] from what I understand about microsite, this will not be the case anymore anywayu [16:24:41] s/anywayu/anyway [16:24:49] how does the microsite knows it's running wcqs vs wdqs? [16:26:15] good question, I assumed so, but it may not be the case [16:26:52] why don't we continue like this: https://gerrit.wikimedia.org/r/c/wikidata/query/gui-deploy/+/717649/ ? [16:27:47] I think that's basically what Erik suggests [16:28:04] I thought so as well :) [16:29:12] (at least after I remembered why we did stuff the way we did) [16:30:00] anyway, if we can drop everything related into microsite I'm all for removing that from puppet and nginx configuration [16:30:32] obviously, we still need to proxy through nginx for authorization, but that's a completely different thing [16:30:50] errand [17:03:13] * ebernhardson got distracted [17:04:52] the microsite knows which it's running based on the incoming domain name, a domain name maps to a single VirtualHost declaration in httpd, that declaration checks the per-site directories for files not found in the main build [17:05:40] i suppose, it's puppet that decides by writing the httpd file with appropriate per-domain paths [17:43:19] hmm, what do we call the nginx that sites in front of blazegraph? Right now it's installed by query_sevice::gui and that name seems incorrect with the separation out to microsites [17:50:16] s/that sites/that sits/ [19:34:47] ebernhardson: hmm naming is always tricky...my first thought was `frontend` since it's the frontend to the backend, but it's not the frontend in the way that httpd is :P [19:34:53] so not sure if that would just make it more confusing [19:35:28] possibly `blazegraph_frontend` although that would need a rename when we're eventually off of blazegraph [19:39:35] yea i can't decide :S For elastic we call it tlsproxy, but thats a specific implementation in puppet so we probably shouldn't reuse the name [19:56:12] I guess one consideration is, at least for wcqs, every request will talk to that nginx right? [19:56:25] yes, but this has to install nginx for wdqs too [19:56:32] right [19:56:48] and this is more of a question for traffic but is the user actually always talking to varnish? [19:57:11] like let's say the thing the user is requesting isn't cached, are they still talking to varnish who talks to nginx and fills its cache based off the response? [19:57:56] basically trying to ascertain if it makes sense to think of the blazegraph nginx as the middleman between varnish and the backend [blazegraph] [19:58:14] hmm, the whole question of what goes to varnish or ATS is bewildering to me :) i thought ats was doing most of the lifting these days, but puppet has >1k varnish references [19:58:57] could certainly consider it a middle-man of sorts. One of many :) Maybe we just call it query_service::proxy [19:58:58] it's a mystery to us all :) I might be using "varnish" imprecisely too [19:59:15] I was thinking of ATS as related to pybal and varnish related to the actual caching [19:59:21] but that might be completely wrong [19:59:53] like I think it might go client->ats->varnish->nginx->blazegraph if we ignore the microsite part [20:00:12] ebernhardson: I like `query_service::proxy`, I think that describes its role pretty well and will be resilient to us swapping out components like blazegraph [20:00:31] and it seems to describe both the wcqs and the wdqs behavior reasonably well [20:01:40] hmm, my naive understanding is most domains CNAME to dyna.wikimedia.org, that resolves via some geo-magic to varnish caching frontends, those frontends do in-memory caching and forward misses to ATS backends, ATS backends have (maybe?) on-disk cache, and on miss forward to the service configured in heira [20:01:41] maybe [20:01:43] :) [20:02:23] at one point backends in not-eqiad forwarded to the frontends in eqiad. No clue if it still works that way now that we are kinda-sorta multi-df [20:02:25] multi-dc [20:03:37] * ebernhardson goes with proxy [20:03:39] oh that totally makes sense, it should have been `client->varnish->ats` otherwise the cache isn't doing anything :) [20:04:23] your naive understanding is quite good :P [20:14:40] ebernhardson: oh and yesterday i wasn't able to get the tunnel working to check out the wcqs internal microsite...I know it has to be something really obvious though [20:14:54] is it not just `ssh -L 80:localhost:80 miscweb1002.eqiad.wmnet`? I suspect the issue is that it's not port 80 on `miscweb1002` but wasn't sure which [20:15:45] here's a magic lsof command I stole from stackoverflow on miscweb: [20:15:49] https://www.irccloud.com/pastebin/gdANYT06/ [20:16:25] ryankemper: hmm, it should be. Although to get it you need https://gerrit.wikimedia.org/r/c/operations/puppet/+/717630 to add the httpd config [20:17:11] ah okay so we don't actually have the [internal] site up currently then? that would explain it [20:17:31] ryankemper: yea, it's deployed to the server but it needs that config file for httpd to respond [20:17:46] right now it returns the static bugzilla archive (probably whichever vhost it returned first) :) [20:18:01] ebernhardson: ah well I get literally nothing currently so sounds like something is wrong on my end then [20:18:40] ryankemper: oh, also you have to forward to 443 and talk https [20:19:08] * ebernhardson hadn't fully tested yesterday [20:20:11] oh, interesting. Actually this wont work because wikimedia.org uses HSTS (tls cert pinning of sorts) and of course we dont have those certs [20:20:24] i guess curl it is :P [20:20:59] to see it real from the public web, i think i have all the patches up except step 2+ for LVS [20:21:08] if you wanna choose some time we could go through them and try and ship most of it [20:21:55] i've never setup lvs before though, so who knows how much work is in there :P [20:22:30] So maybe early next week let's see how much progress we can make [20:22:50] ok [20:23:00] I still need to get that dns patch out [20:24:21] and then I forget the blocker on the LVS patch (https://gerrit.wikimedia.org/r/c/operations/puppet/+/713959), it was either getting oauth in place first, or the dns change, or likely both [20:27:28] hmm, oauth shouldn't need to be setup first. config for that should be coming into place soon, will be taking a deeper look into the nginx config today and tomorrow and should be able to find the right oauth data there too. Guessing, LVS should only need the health url's to be working [20:28:02] before lvs though it needs the netbox change to assign the internal ip's lvs will use, and then the dns patch to give those ip's names [20:28:15] i guess netbox change is some ui, not gerrit [20:28:54] yeah should be in the ui [20:45:24] patch to remove the codfw hourly restarts if anyone wants to take a quick look: https://gerrit.wikimedia.org/r/c/operations/puppet/+/720102