[04:14:28] Krinkle: yes, they are replicating from the current hosts, in 20 days we'll promote them to active on db-eqiad.php [04:14:33] (same with codfw) [06:31:59] Hello folks, I see multiple transport links down [06:32:22] - Telia cr1 eqiad -> codfw down, seems due to maintenance [06:32:34] - Lumen codfw -> ulsfo [06:33:09] - Lumen eqiad -> esams [06:35:12] From my very ignorant point of view the situation looks stable redundancy wise, but usually we don't have so many links down [06:37:53] mmmm also Lumen's maintenance seems to have been cancelled (at least from gcal), checking emails [06:39:19] no ok I found some Lumen correspondence for maintenance postponed to today for both links [07:01:13] topranks: --^ [07:01:42] Speaking of topranks - awesome analysis for the ML team: https://phabricator.wikimedia.org/T287238#7237293 (two comments in the task, worth to read) [07:18:32] (one link seems to have recovered in the meantime) [08:10:46] Just logged in - checking the transports now. [08:14:35] I see you beat me to it - yep Lumen wasn't updated in calendar. Thanks for that ! [08:21:42] <_joe_> elukey: damn yeah we were aware of that issue with iptables but somehow we didn't remember about it when you went with buster [08:23:00] _joe_ learned a lot in the process, so I am happy about it :) [08:23:29] I was wondering if wmcs folks had the same issue with Buster, I found kubeadm::calico_workaround in puppet [08:34:37] dcaro: if by any chance you're around I have a question for you wrt some cookbooks [08:34:56] volans: I'm around [08:35:39] as we've migrated all usage of Icinga to IcingaHosts, I was planning to drop it from Spicerack [08:36:03] nice [08:36:09] and noticed it's still imported/used by 2 wmcs cookbooks in the master branch. Is correct to assume the wmcs/ path in the master branch is "stale/outdated"? [08:36:44] not all of it [08:36:51] only the parts changed in the wmcs branch [08:37:23] let me rephase, is anyone in WMCS running the master version of those cookbooks? [08:37:32] or the ones in the wmcs branch are the authoritative ones [08:38:29] I think that some of the cookbooks are still run from cumin, those use the master branch (not sure if they have been run lately though) [08:38:59] I've seen the add_wiki being run, that's for sure [08:40:27] to reduce confusion should we remove from the master branch those that are outdated / will not work anyway if run in prod? [08:46:31] what's the plan on merging the branch? [08:46:52] (if it's going to take long, then we shoul plan on an alternative I guess) [08:50:53] I think that's part of more broader discussions that are ongoing, right now I'd like to address the current status of the branch and the fact that some cookbooks live in both branches with two different versions, of which I guess only one is correct [08:51:19] for example https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/master/cookbooks/wmcs/openstack/cloudvirt/unset_maintenance.py vs https://gerrit.wikimedia.org/r/plugins/gitiles/operations/cookbooks/+/refs/heads/wmcs/cookbooks/wmcs/openstack/cloudvirt/unset_maintenance.py [08:52:00] sure, for any duplication, prefer the wmcs branch [08:52:20] should we drop those from master then? [08:53:01] so that those remaining in master are the ones to be used from master [08:54:31] sure [08:54:55] make sure to rebase the wmcs branch on top (and add a 'restore cookbooks' patch at the base of it) [08:56:11] but I don't know which one in the wmcs/ are supposed to be run from master, apart add_wiki [08:56:24] any that has a newer version on wmcs [08:56:57] wait, the other way around xd, the ones that run from master are any that does not have a newer version on wmcs branch [08:58:49] I'm not sure that's always true, you might have developed one before the branch, supposed to run locally and never edited it :) butsurely is from where I would start [09:03:18] my internet flapped... not sure my last message was sent, in any case, let me know if you want any more input (not sure if there was a question in your last message) [09:03:37] last I got from you was 'wait, the other way around xd...' [09:03:40] to which I replied [09:03:47] I'm not sure that's always true, you might have developed one before the branch, supposed to run locally and never edited it :) but surely is from where I would start [09:04:41] was looking right now and for example some toolforge ones are identical in master and wmcs but I suspect they would work only locally [09:06:21] anything that tries to connect to any VM will only work from the laptop as the issue is the intentional lack of access from the cumin to the VMs [09:07:07] so I'd say most of toolforge and vps modules [09:07:45] can I assume all of them? or should I check them one by one? or can you tell me which one to keep? [09:07:49] but the others have been improved also [09:08:05] (or most of the others) [09:08:06] *all of them aI meant toolforge/ and vps/ [09:09:18] the thing is, that we will make any changes to our cookbooks only on the wmcs branch [09:10:12] so any cookbook you leave behind will eventually be changed in the wmcs branch, specially if it take time to merge it (that it seems it will) [09:10:30] are you suggesting to drop the whole wmcs/ directory from master? [09:11:48] yep, and allowing the wmcs branch to be in the cumin host too, for the wmcs cookbooks [09:12:31] there is only one that works AFAICT wikireplicas/add_wiki.py [09:12:55] any related to ceph/cloudvirt should work also [09:13:08] (have not tested them) [09:15:03] none of those exists in master, and I bet they depends on "modules" added to the wmcs branch [09:15:29] * volans has a hard-stop in 5 minutes FYI, sorry in advance [09:15:45] they sholud work on the cumin host though [09:16:08] but having the wmcs branch on the cumin hosts would be equivalent to merge back as-is everything into master [09:17:08] or not, depends on how you deploy it (most people could be still using the master branch, only wmcs would use wmcs branch for specific cookbooks) [09:20:42] de facto circumventing the reasons for which we have the branch in the first place though ;) [09:21:59] I think we have different goals for that branch xd [09:22:47] if the branch was to avoid wmcs from running their cookbooks, well, that did not work out (I can reach any host from my laptop) [09:23:39] if the goal was to avoid other teams from using wmcs code, then having a different path in the cumin host for the wmcs branch should suffice [09:26:06] the goal was to temporary unblock wmcs development deferring the proper implementation of the needed modules to be included into spicerack to a bit later [09:26:53] sorry, really gotta go right now, I'll be back in ~1h [09:27:22] okok, then deploying that branch on cumin does not seem conflicting with that goal [09:27:34] I'll send a patch for the cookbooks based on what I find in git and we can take it from there [10:43:08] volans: I'd recommend having a plan first, before removing any cookbooks, feel free to ping me/setup a meeting to discuss [10:58:42] dcaro: I think that the plan is simple, avoid duplicated stale cookbooks in master that are developed in the wmcs branch. So keeping in master just those that are supposed to be run from the cumin hosts and for which the copy in master is the authoritative one. To the best of my knowledge, looking at SAL and spicerack's logs, that's only wikireplicas/add_wiki.py. [11:04:35] <_joe_> can I ask for someone to +1 https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/708255 ? [11:04:53] <_joe_> it will allow for mw on k8s to be reachable from the wikimedia-debug extension [11:10:16] looks good to me :) [11:10:49] at least from the caching layer perspective that's the X-Wikimedia-Debug value that ats expects :) [11:17:10] <_joe_> vgutierrez: uhm it doesn't seem to work, I'm perplexed [11:23:02] uh [11:30:50] <_joe_> vgutierrez: https://gerrit.wikimedia.org/r/c/operations/puppet/+/708274 [11:41:25] <_joe_> and it now works [11:42:35] <_joe_> you can now browse the wikis with a slightly outdated version of production code using the wikimedia-debug browser extension and selecting 'k8s-experimental' from the dropdown menu [11:42:54] <_joe_> just don't expect anything to work correctly, but if you find some glaring issues, let me know [12:21:36] _joe_: i can browse most stuff fine. My global.js page gives a runtime error though (so does my funky backend performance script but I expected that). [12:21:46] anyway i need to do actual work [12:22:32] <_joe_> RhinosF1: as i said, some things are probably out of sync [12:24:32] Yeah I'm sure it filled logs up for you if it was of use [12:24:49] Browsing pages seemed fine [12:32:53] <_joe_> heh logs are something I still have to work on, to get them in a separate log file [12:36:22] volans: Can you please sync that plan with jobo and balloons? thanks. [12:38:45] dcaro: sure [12:44:01] * balloons reads backscroll [12:49:02] I'll open a task, should be quicker balloons ;) [13:10:58] dcaro, balloons, jobo: I've created T287465 for the above. [13:10:58] T287465: Cookbooks repository: avoid stale code in master branch - https://phabricator.wikimedia.org/T287465 [13:17:15] Reading trough the thread, I'll check on the task and comment. Thanks both. [14:29:18] Whenever we make any changes or sent patches like in above move from Icinga to IcingaHosts we need to do so for for all cookbooks. By all means we should avoid having to update dead or obsolete code along so proposed path forward https://phabricator.wikimedia.org/T287465 sounds like most reasonable way to go for me. [14:50:17] the code is not dead, it's frozen, and still changed in the wmcs branch, so it has to be changed there too [14:52:09] (as in that Icinga change will have to be propagated to the other branch in any case) [14:55:05] Quick reminder I will be adjusting the buffer config on switches in eqiad row B in 5 minutes [14:57:22] thanks topranks ! The WMCS team decided to white-knuckle it rather than shutting things down so we'll all be watching and holding our breath :) [14:57:50] seat of the pants stuff :D [14:58:47] yeah, any preventative measures we could think of seemed more dangerous than doing nothing [15:00:25] yeah probably the best approach. [15:00:30] Ok I will log on and prep the config [15:01:39] Executing change to egress buffers... [15:02:30] Executing change to ingress buffers... [15:03:07] Complete. [15:03:18] looking good on my end so far, thanks [15:03:34] same [15:03:36] great yep, one of the hosts I was pinging was a ceph server, no ping loss. [15:03:54] I'm re-enabling Puppet [15:04:57] \o/ [15:04:59] cool \o/ [15:05:03] nice job again! [15:05:55] puppet is enabled again [15:07:22] <_joe_> topranks: you're doing it all wrong, you're spoiling people. Now when one maintenance will cause issues, everyone will be disappointed [15:07:42] <_joe_> I would shut down one of the switches for a few minutes just to tamper the enthusiasm [15:07:57] haha... that will probably be Thursday, like you say the JunOS demons are just lulling us into a false sense of security :) [15:07:59] <_joe_> /tamp/temp/ [15:08:03] haha [15:09:49] All seems good that I can tell, nothing in logs looking bad, LibreNMS has polled and switch is healthy. [15:10:03] No issues with endpoint connectivity, mac tables etc are the same as before. [15:11:20] topranks: thanks, I am going to revert the proxy changes then (as they are needed for row A maintenance) [15:11:41] marostegui: great yes please do :)