[07:36:12] morning [07:36:33] morning [07:37:06] mmm... ceph woke up troubled, mons cloudcephmon1001,cloudcephmon1002,cloudcephmon1003 are using a lot of disk space, looking [09:17:07] dcaro: I'm moving the main cloud vps novaproxy floating IP to use a keepalived backed port for https://phabricator.wikimedia.org/T316982. I don't think it will page, but fyi regardless [09:17:34] thanks [09:19:31] {{done}} [09:20:55] everything looks ok so far [10:22:30] dcaro: taavi: hi, I have ping you on a Slack thread filed by Kara from WMDE [10:22:52] they have lost access to a toolforge group and could use some developers to be added to it [10:23:10] I don't know anything about the process :D [10:23:20] * taavi looks [10:23:58] most probably Kara, as a WMDE engineering manager, can be added as a maintainer of the tool and from there manage the group membership [10:24:53] hashar: this is the admin channel, please direct folks to #wikimedia-cloud and not here :/ [10:27:53] OH NO [10:28:20] I have amended the message [10:28:34] <3 [10:28:49] and dcaro added them to the group \o/ [10:28:50] success! [10:28:59] thanks for the quick reply/acting [10:29:05] split brain! xd [10:29:25] I am off for lunch & [10:29:59] yeah I don't like how that was handled in private without any public traces at all [10:31:02] well you can ask them to file a task to record the decision publicly [10:31:22] anyway I gotta cook! :D [10:33:05] just requested to create such a task [10:37:52] T348968 donee [10:37:53] T348968: Adoption request for Item Quality Evaluator - https://phabricator.wikimedia.org/T348968 [10:38:00] * dcaro lunch [12:40:12] doing some archeology in Gerrit, I found a series of change from 2017 for the Puppet module `gridengine` which no more exists. [12:40:19] I guess they can be abandoned https://gerrit.wikimedia.org/r/q/topic:gridengine+is:open [12:40:55] they are attached to T162955 which is marked resolved [12:40:56] T162955: rebuild tools-grid-master as a large instance - https://phabricator.wikimedia.org/T162955 [13:08:04] hey I think I messed up something trying to spin up a standalone puppet server on horizon, I was following https://w.wiki/7oWD and ended up with an error https://usercontent.irccloud-cdn.com/file/gMuRrpxI/image.png, have I missed some documentation or messed something up? [13:13:30] arnaudb: yeah it is bugged [13:14:42] what I do is I keep the local puppet master to the global WMCS Puppet master [13:17:47] hashar: you mean you rsync from WMCS puppet master to your standalone instance? [13:18:24] * hashar fights with Horizons 2FA [13:19:06] ah [13:19:17] so for the CI agents running in the `integration` project, I have a local Puppet master [13:19:31] but its puppet agent runs from the global WMCS agent, not from the local puppet master [13:19:37] classes: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/refs/heads/master/integration/integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud.roles [13:19:43] hiera values: https://gerrit.wikimedia.org/r/plugins/gitiles/cloud/instance-puppet/+/refs/heads/master/integration/integration-puppetmaster-02.integration.eqiad1.wikimedia.cloud.yaml [13:20:07] so the instance has `role::puppetmaster::standalone` to make it a puppetmaster [13:20:20] `role::puppetmaster::standalone::autosign: true` to autosign certificate requests from agents attaching to it [13:20:50] AND `puppetmaster: puppet` which instructs the agent running on that puppet master to use `puppet` as the master, and hence rely on the global WMCS Puppet master instead of itself/localhost [13:21:12] but it might be fine to attach [13:21:46] the reason of the error message is I think the puppet agent first ran against the global WMCS agent and stored it is cert [13:21:57] then once Puppet ran, that switched it to attach to your local puppet master [13:22:03] and you end up with a certificate mismatch [13:22:09] looks like it indeed [13:22:13] hashar: thanks for the infos, will try to debug myself out! [13:22:23] so you can run the puppet cert clean command which will cause the Puppet master to forget that agent [13:23:19] and the other find command is to delete the certificate remembered by the agent [13:23:29] `server = puppetmaster.cloudinfra.wmflabs.org` yep, it seems that I was still attached to that master indeed [13:27:01] if you clean the certs, I think that will be fine [13:27:20] and the agent on your puppetmaster instance will run against the puppet master running on it [13:28:02] I'll try after our weekly [13:28:06] thanks! [13:37:01] ;-] [13:48:46] if the jobs framework supported it, what would you think about making the `webservice` command run the webservices as a job instead of talking to k8s directly? [13:52:25] In my mind we will move to an 'app' concept, that will merge both jobs and webservice, in the sense that you have an app, and you can run it once, run it periodically or continuously (current jobs), and when run continuously, you can open ports internally (not explicitly supported now), and one as http to the world (that's the current webservice) [13:53:06] point being that I think that a continuous job and a webservice should only be different in the port they expose, the latter being http and accessible from https://.toolforge.org [13:54:25] I'm trying to articulate a proposal to create that "app" service and move webservice functionality there, and then jobs functionality (or at least parts of it) [13:56:25] dhinus: did you start working on fixing the tests for wmcs-cookbooks? [13:56:47] can I help? [13:57:49] not yet, I went into a rabbit hole about T348668 :) [13:57:49] T348668: Trove instances not being created or restarted with configuration group applied - https://phabricator.wikimedia.org/T348668 [13:58:13] xd, okok, I might give it some time then [13:58:25] sounds good, feel free to claim the task as well [13:59:34] dcaro: right. that's the basic idea I had as well, just calling things with different names. that's helpful, thanks [14:00:14] I'm not married to any naming yep, just trying to express that it's not what currently we call job, not webservice [14:00:20] *nor [14:01:12] what do you mean by "not what we currently call a job"? just that the current jobs framework can't do any networking (k8s service) stuff? [14:02:17] not only, currently jobs are kinda specific, in my mind an app keeps some info boundled together, like the repository the code is in, the build configuration, etc. [14:02:29] things that a current job does not care [14:59:48] can I get a +1 on this quota increase request? https://phabricator.wikimedia.org/T348441 [15:00:18] done [15:02:44] thanks [16:01:46] dhinus: there's some pretty obvious breakage in codfw1dev with designate-sink when cleaning up DNS records for deleted VMs. Want me to fix it and cc you, or save it for you to fix? [16:05:10] please go ahead and cc me [16:05:27] anyone wants me to mention something about WMCS in the SRE meeting that's just started? [16:06:17] maybe mention the ceph breakage of hard drives and the pending incident review for the dhcp client removal [16:06:52] good one. I'll also mention Object Storage if andrewbogott agrees [16:07:19] +1 [16:07:22] dhinus: yep, you can use the doc link in the announce email. [16:07:37] dhinus: https://wikitech.wikimedia.org/wiki/Help:Object_storage_user_guide [16:09:37] thanks [16:10:11] dcaro: what's the plan for the incident review? is a date already set or is it pending discussion? [16:10:56] no date yet that I know of, they should come back to us [16:11:52] wait, actually, the date should be 2023-10-23, though they should reach out to coordinate [16:12:53] do you have the phab link handy for the disk issues? [16:13:21] yep, one sec [16:13:28] https://phabricator.wikimedia.org/T348643 [16:13:41] thanks [18:39:40] taavi: is the proxy endpoint in codfw1dev expected to be down? (just asking before I dive in) [18:40:09] andrewbogott: no [18:40:20] ok, then I will investigate [18:41:13] hm, that's a new one 'AttributeError: '_FakeStack' object has no attribute '__ident_func__'' [18:41:28] I assume it's actually that it cant' contact the db [18:43:39] oh, it just needs a package upgrade [18:52:33] topranks: my connection to codfw1dev VMs keeps flapping. It seems to be specific to that network, my connection with prod servers is stable. Is that something you know about? [18:57:53] Hm, and now there's schema disagreement. taavi I'm guessing you updated the schema in eqiad1 but not in codfw1dev? [18:58:45] andrewbogott: oh, right, definitely possible, I did deploy the project name to id migration patch quite recently [18:59:30] I got slightly lost with the rename you made but I can figure it out if you're not working at the moment. [18:59:59] this is what you need to run on the codfw's equivalent of cloudinfra-db03: https://phabricator.wikimedia.org/P52977 [19:02:32] thx [19:03:03] ...if only the network stays up for long enough... [19:10:30] seems not to want to rename indexes [20:02:50] andrewbogott: yeah in tests I'm seeing some packet loss [20:03:42] ultimately looks to be similar to last week, we really need to follow up and fix T348140 [20:03:43] T348140: Change cloud-instance-transport vlan subnets from /30 to /29 - https://phabricator.wikimedia.org/T348140 [20:17:15] topranks: when you say 'we' -- are there action items for wmcs folks still? [20:17:45] There are action items for whoever is in charge of deploying and maintaining the Openstack neutron component [20:18:13] oh dang, that's probably me now :( [20:18:34] I meant "we" as in "us" though :) [20:18:57] it's clear enough there aren't docs on the setup, and nobody is 100% sure of how its done, despite the required changes being fairly clear [20:19:07] I am in the middle of another thing right now but may bug you about that tomorrow. [20:19:35] we can pick it up tomorrow and see about doing it, from what we discussed last week it's hopefully just deleting the "port" and attached subnet, then re-creating both [20:20:16] that seems possible :) [20:20:17] FWIW I left a ping running on cloudgw in a screen session to try and keep the arp cache current on it, and performance seems better than then [20:20:51] although I see this kind of thing: [20:20:58] whatever I was seeing just now was unusably bad but that might turn out to be a different issue [20:21:03] https://www.irccloud.com/pastebin/0fbXsQV7/ [20:21:18] nah there is some issue, I wonder if something is happening on the cloudnet side [20:24:04] The ping was to 185.15.57.2, which is bastion.bastioninfra-codfw1dev.codfw1dev.wmcloud.org [20:25:05] The "destination unreachables" I see coming back are from cloudnet2005-dev which should NAT traffic for that IP to 172.16.128.19 [20:32:37] sry .11 in the above is the cloudgw side, so ignore that, cloudnet<->bastion vm seems ok, issue still appears to be the arp thing [21:03:56] sorry topranks, various life things are happening and I have to ghost for now. Might or might not be available to work tomorrow. Thanks for the info, I'll save the backscroll :) [21:43:35] andrewbogott: ok, hope all is ok. I'll ping david and ta.avi tomorrow and see if we can work it out