[00:06:46] legoktm: if you feel like killing some more mailman2 stuff https://gerrit.wikimedia.org/r/c/operations/puppet/+/716077 [07:53:10] marostegui could you take a look to https://gerrit.wikimedia.org/r/c/operations/puppet/+/715742 as well? [07:53:20] sure [07:53:31] I guess I need somebody else from WCMS, arturo is on vacation currently [09:00:14] hey dcaro, could I get your review on https://gerrit.wikimedia.org/r/c/operations/puppet/+/715742? O:) [09:02:45] looking [09:06:04] vgutierrez: if I understand from the comments, this will only trigger a systemctl daemon-reload, not an haproxy reload right? [09:06:23] dcaro: yes, you're right [09:06:37] vgutierrez: then +1 from me [09:06:52] could i get that on the gerrit change? :) [09:07:54] done [09:31:29] thx <3 [10:11:46] jbond: hi, I am rebuilding the operations/puppet CI Docker image. That will take a while though (slow disk) [10:12:08] hashar: ack thanks <3 [10:38:15] jbond: Successfully published image docker-registry.discovery.wmnet/releng/operations-puppet:0.8.5 [10:40:11] great thanks hashar [10:41:53] jbond: I am updating the jenkins job [10:43:24] {done} [10:43:33] thanks [10:47:23] fricking DSL... 10 minutes to download the latest CI image [10:47:42] just to get a fatal [10:47:47] fatal: cannot update ref 'refs/tags/docker-head': trying to write ref 'refs/tags/docker-head' with nonexistent object 5de4f3229e45f7623e3a5d56fb6cd2d43224d021 [10:47:51] :( [10:48:30] hashar: it looks like 0.8.5 breaks utils/run_ci_locally.sh [10:50:04] hmmm it's working after running a git fetch [10:50:35] so I guess there is an implicit requirement there [11:10:37] Hi, I'm seeing `ssh: connect to host mw2264.codfw.wmnet port 22: Connection timed out` from scap for mw2264, is that an issue? [11:10:44] (crossposted from -operations, see there for more context) [11:47:21] urbanecm: https://phabricator.wikimedia.org/T290242 [11:47:32] thanks XioNoX [14:06:23] Since joe is away, who can I bother about jobs? Context: We picked up T48643 [14:06:23] T48643: [Story] Dispatching via job queue (instead of cron script) - https://phabricator.wikimedia.org/T48643 [14:08:22] Amir1: so are we talking about a 'joe job' :-P [14:09:30] :D [14:15:55] Amir1: if you give me some context as to what you need specifically [14:16:00] I may be able to help [14:16:36] effie: first and foremost it's heads up. [14:16:49] Second, we might introduce a lot of jobs [14:16:49] ACK [14:16:50] :p [14:17:01] now send me the ciphersuite :D [14:17:31] I skipped the syn-ack, so this convo is dodgy as it is [14:18:06] :D [14:18:34] ok so, if we have an easy way [14:18:49] to turn it if off in case it starts being excessive, that would helop [14:19:02] also, I think we can tune concurrency in jobqueue [14:19:05] the thing is that this might need a "high prio" job and also a way to deduplicate properly [14:19:21] have you discussed it with petr ? [14:19:26] dedupe is a bit tricky [14:19:33] he's on vacation :D [14:19:57] oh god, we are really going barefoot here [14:20:42] ok, other notable information >? [14:21:51] can we use MainStash? [14:22:05] I want to find a way to properly deduplicate [14:22:13] let me summon Krinkle at this point of the convo [14:22:14] or how reliable is memcached [14:22:40] I think Petr will come back soon [14:22:43] if a server goes down, this shard is gone [14:23:18] new data will be routed to the gutter pool, the servers that take over when a shard is down [14:23:41] Hi [14:24:25] Krinkle: it is about https://phabricator.wikimedia.org/T48643 [14:25:14] and amir was asking if Mainstash could be used for dedup [14:25:21] but I cant answer that [14:26:15] or any way of deduping would work but not as simple hash [14:26:15] What's wrong/limiting with built in job dedupe logic? [14:26:41] Krinkle: I'm looking into it but I couldn't find a good way to do it [14:27:17] so let me explain: It's quite common that lots of edits happen in wikidata back to back (with below one second timediff) [14:27:23] Ok. Imagine it doesn't exist, describe how you'd do it with main stash. [14:27:28] because of how the old termbox works, etc. [14:28:00] Eg what is checked before pushing job, what's the job spec, what checked at runtime [14:28:11] each edit then it would trigger a job which in turn would trigger jobs in the client wikis [14:29:13] okay, let's go the root job for each edit would be [revid=> 1234, entity id => Q123, aspects changed => C.P123] # Item Q123 has statment P123 changed [14:29:36] the subsequent edit would be [revid=> 1235, entity id => Q123, aspects changed => L.en] [14:30:11] we want to be able to aggergate the revids and the aspects because it matters what jobs will be triggered down the road [14:30:34] it'll be a quite tree of jobs, several levels (it's already a couple of levels) [14:31:08] This root job is local to wikidata yes? [14:31:25] Yup [14:32:02] the job will take a look at what wikis are subscribed to that entity and triggers a job in each of those wikis [14:32:25] And the current system append to a do table where the cron job scrapes it every few minutes and merges each batch accordingly where possible [14:32:45] db table * [14:33:13] yup, wb_changes [14:33:35] which has its own cron to delete old changes (older than 3 days), tech debt from top to bottom [14:34:20] I don't remember if we talked about this already do or whether that was with addshore.. I recall suggesting somewhere that maybe you could keep that design. Replace the cron with a deduped job, the job would be generic always exactly the same and no parameters [14:34:41] we can sorta keep that table but it's quite a terrible thing, specially since most edits in wikidata don't trigger edits on client wikis anymore (papers, etc.) [14:35:03] papers? [14:35:20] bots batch upload every paper ever published [14:35:40] that's around more than half of wikidata now [14:38:23] Ok. I think you're saying that if we use jobs without the table, we can avoid somehow avoid storing temporary change data for those unused wikidata items [14:38:37] Eg they would not push this non-dupe job [14:38:40] How? [14:40:07] "Eg they would not push this non-dupe job" I have trouble understanding this :D [14:40:09] Krinkle: we did indeed touch on this at some point in the past year [14:41:16] Amir1: I'm just checking whether the paper data is something you see as a downside of the current design that we can fix with this redesign now, or if it's a distraction [14:41:48] my assumption is that in main stash, we can keep it for much shorter and doesn't need a cron to evict (hopefully) [14:42:12] Well it'd have a different cron like parser cache [14:42:31] But then it becomes someone else's problem :) [14:42:38] but also, I hope we can change the subscriptions and avoid triggering even the root job which means it won't go into main stash [14:42:53] Krinkle: in nice words, simplifying our infra :D [14:43:14] And we should keep in mind that we also want to "simplify" this for 3rd party wikibase users [14:43:15] now it has a dedicated cron [14:44:00] If you can avoid triggering the root job, can you avoid appending the information to wb changes table? They seem like the same thing [14:44:08] addshore: do you know if third parties are going to have mainstash or something similar as default? [14:44:42] "If you can avoid triggering the root job, can you avoid appending the information to wb changes table? " some notes [14:44:51] It'd also an optimisiation I'd like to ignore for now to understand the rest first, maybe.. [14:45:07] 1- I'm not sure if we can avoid triggering the root job, I have the hypthese but need to double check [14:45:37] 2- The problem of duplicates and deduplication is another aspect. It's several edits being sent back to back. Not papers [14:45:45] MainStash by default is SQL object cache table with random deferred to garbage collect a few expired rows [14:46:11] It's for strong persistence [14:46:40] I have been thinking that even WAN cache would work too [14:46:45] Yes [14:46:46] depends on how reliable it is [14:47:11] and I assume we can tolerate some loss (0.1%?) [14:47:23] it's not canonical date [14:47:26] *data [14:48:21] * addshore is in a call right now but will catch up in a bit [14:48:45] MichaelG_WMDE is also in the hike [14:49:42] Krinkle: did I answer your questions wrt different aspects and usecases? I'm more than happy to jump on call [14:50:23] Amir1: What's the ideal flow for your idea? Eg what does into WAN? How does the wan data help reduce root (Rev,Q,aspect) job? [14:50:39] does*goes* [14:51:36] the job gets deduplicated by q-id, an edit puts revid and aspects changed into WAN and queues the job [14:52:21] one of the jobs being triggered from the consecutive edits pick up the q, reads WAN, and trigger the jobs down below [14:52:40] (jobs down below need to know what aspects have changed and what the rev ids are) [14:52:56] So the second edit would lock the memc key and modify it to add more rev and aspects? [14:53:04] yup [14:53:17] that's the idea I have [14:53:19] And then when the memc key is pushed out the job is broken [14:53:27] Critical data can't be in memc [14:53:47] yeah, that's why I was thinking of main stash [14:53:53] Dedupe system uses memc as well but it gracefully degrades to having duped [14:54:06] Okay [14:54:17] I think the table would work better for this since it's short lived [14:54:37] Main stash DB will not work well unless you also delete each key explicitly yourself [14:54:51] Otherwise it'll build up a huge GC queue [14:55:57] The job can then remove the rows it has successfully processed right? [14:56:01] I think it's possible to do so when the job is done but also what if it gets changed in the mean time? It can redo it again [14:56:28] Append only not modify [14:56:59] oh that simplifies things a bit :D [14:57:05] that sounds quite ncie [14:57:19] Does the table have an int primary key? [14:57:54] Or is it currently unique on the local content structure itself ? [14:58:46] wb_changes? it has int PK [14:59:08] Okay so you can reliably select by Q, and then remove those rows by AI key [14:59:59] It might still be simpler though to go for a totally generic job, basically doing what the cron does now, completely deduped and centralised [15:00:09] Has smaller risk of leaving rows behind [15:00:34] hmm, sure [15:00:48] I don't know how perfect the dedupe tracking is, eg races around adding rows then queuing the job at the same time as job queue popping the job [15:01:04] I mean worst case if for whatever reason it doesn't work, we can swap that part but the main part is the job [15:02:25] If the pending dedupe marker is removed before the job request even starts then it should be fine, but yeah might be simpler to keep more of what you have at first. Can always refactor more right? [15:03:00] Just don't execute the shell script as the job :P [15:03:09] But almost that basically [15:03:26] Does that sound terrible? [15:03:46] What comes to mind as suboptimal [15:04:12] haha sure I promise :D [16:40:25] bd808: I +1ed your toolhub changes, let's see how it goes [16:46:18] effie: *excitement builds* Thanks for the review. Splitting that mcrouter patch made be realize that I was not setting up the tests to cover it properly which then helped me find and fix a couple of problems in the templates. [16:49:33] effie: would you have time to help me get a secret added for this template? l.egoktm setup several, but I did not have the real WIKIMEDIA_OAUTH2_SECRET value then. I have it now in mwmaint1002.eqiad.wmnet:/home/bd808/toolhub_secrets for a root to copy into the right magic puppet/private place. [16:49:39] I once run into an issue where I updated the chart and values, and I hit some limites while k8s was trying to bring up the new pods, and create a sweet mess [16:49:45] so I learnt my lesson [16:50:28] I can't promise it for today, I will be on and off my laptop [16:50:47] no worries. I can nerd snipe k.unal later ;) [16:59:22] \o/ exciting times [16:59:28] I can do it now [16:59:55] thanks legoktm [17:03:14] bd808: {{done}}, you should be able to check deploy1002:/etc/helmfile-defaults/private/toolhub/*.yaml have the correct value now [18:17:25] legoktm: is there anything special you know of that needs to be done after merging a new helmfile.d/service entry? https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/716235/8#message-393290fca553cfafdb3c5bf4e2f69c88695e7b4f [18:19:41] I think the problem is because CI runs against the published chart, so for your helmfile.d it was running against a chart that didn't have your mcrouter changes. But now it's running against the published chart with those changes, and something is incompatible with the new chart + your helmfile.d [18:21:13] That kind of makes sense. it is failing for things that are in the v0.0.4 chart. But that also makes all of this even more scary if the tests are not testing the soon to be reality but instead some past that will never change. :/ [18:22:42] agreed, it took me a while to figure out that I needed to split chart and helmfile.d changes, waiting in between for the chart to be published (which really limits the usefulness of CI) [18:23:22] Ok. What I can do right now is roll back my helmfile.d bits to unblock everyone else. [18:23:43] +1, I'm also trying to figure out the failure [18:25:54] The chart expects data from /etc/helmfile-defaults/mediawiki/mcrouter_pools.yaml and that's not present in a fixture. I think that's the whole mess. [18:29:36] but...how is the mwdebug helmfile.d passing? [18:31:02] legoktm: I think because charts/mediawiki/values.yaml has test data that my chart does not have [18:31:37] I think the fix for all of this is adding some stub data to the toolhub chart's values.yaml [18:32:05] Did you try moving your .fixtures file into helmfile.d/ instead of having it in the chart? [18:32:58] > In addition to this, all service deployments under helmfile.d/services are checked as well. Given some of those would need private data that is not available in testing/development, you can provide a special file called .fixtures/private_stub.yaml to simulate populating such data in deployments. [18:33:01] helmfile.d has different fixtures, but yeah I could probably also fix this particular test with stub data there [18:34:22] > Since you might want to test various features in your charts, helm template will be run both with the default values in the chart and with values provided by any YAML file in the .fixtures/ directory. [18:34:30] so it does sound like you need working defaults regardless [18:34:41] legoktm: I'm going to eat lunch and figure out a fix when I'm back, but I think adding some test data to helmfile.d/services/toolhub/.fixtures.yaml is the thing this test needs [18:35:11] sounds good :) [19:44:46] legoktm: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/716521 is passing! A quick look and +1 would be appreciated when you get time. [20:07:02] looking! [20:57:10] bd808: are you going to try to deploy it to staging? [20:57:55] legoktm: I would like to yes. I fell into a shallow hole of trying to update the README.md for the repo because it is way, way out of date. :) [21:08:01] readme patch pushed to gerrit for review [21:08:44] legoktm: is there anything other than https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Deploying_with_helmfile I should be reading before trying to deploy into the staging cluster? [21:08:59] nope, that's it [21:09:13] helmfile -e staging -i apply [21:09:26] ok. well let's see what happens then :) [21:21:15] legoktm: things happened! and now I need to debug a thing that didn't happen as expected. [21:21:34] :) :( [21:21:52] I don't see any new pods in the toolhub namespace [21:21:55] I sort of assumed the first attempt would find something I missed [21:22:10] yeah it all rolled back. Error: release main failed: Deployment.apps "toolhub-main" is invalid: spec.template.spec.containers[2].volumeMounts[0].name: Not found: "toolhub-main-mcrouter-config" [21:22:28] shouldn't be too hard to figure out [21:29:03] hmmmm... maybe harder than I expected. The config map that it is complaining about is in the diff of things to be applied. [21:30:01] I wonder if it's an order of operations problem? https://kubernetes.io/docs/concepts/storage/volumes/#configmap says "You must create a ConfigMap before you can use it.". [21:30:33] How can I tell what order helmfile + helm is provisioning the resources in? [21:34:50] The interwebs tell me that helm should take care of ordering in a deterministic way to avoid this. So there is something else funky here I guess. [22:16:37] legoktm: I'm stuck for the moment trying to figure out T290283. [22:16:37] T290283: `helmfile -e staging -i apply` fails for Toolhub due to missing ConfigMap - https://phabricator.wikimedia.org/T290283 [22:18:22] I'm afk right now, I'll take a look when I'm back [23:31:29] * legoktm is looking now [23:42:19] bd808: I posted a theory on the task based on what I have in Shellbox