[09:32:20] errand+lunch [12:13:02] dcausse: Following up on https://wikidata.beta.wmflabs.org/w/api.php?action=cirrus-settings-dump - will beta wikidata have to be re-indexed for changes to take effect? [12:13:40] searching for mul labels still doesn't return any results: https://wikidata.beta.wmflabs.org/w/rest.php/wikibase/v0/search/items?language=en&q=mul-label [12:14:46] sorry, wrong URL - I wanted to link to the gerrit patch: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/1143780 [12:28:44] Silvan_WMDE: yes it has to be reindexed, lemme run this [12:32:44] re-indexing, will take ~10mins to complete, the mul field is now here from what I can see from the newly created index [12:44:54] Silvan_WMDE: done, it's working now [12:51:24] cool, thank you! [13:09:33] FYI kafka-stretch2001.codfw.wmnet has puppet disabled since 10 days. Host should never have puppet disabled for longer period. The linked task ( T393177 ) seems unrelated. cc inflatador [13:09:33] T393177: Ensure stat hosts' SSH is responsive regardless of user-generated load - https://phabricator.wikimedia.org/T393177 [13:13:55] volans that's a test server, I've been using it for the above task. Is there a way to disable puppet checks for a server? I can turn it back on but it'll undo some of the work I've done. Let me know what you prefer [13:17:49] after few more days the host will disappear from puppetdb, monitoring and everythign and will be reported as ghost in a netbox report [13:18:36] ACK, so it's easier for y'all if I just turn it back on? [13:18:59] just re-enabled [13:21:21] inflatador: as referenced also in https://wikitech.wikimedia.org/wiki/Puppet#Maintenance (the warning box at the end of the paragraph) a prod host is supposed to have puppet running all the time. One additional reason is that when people do refactoring often the do a patch to absent old resources, merge it wait a bit and then merge the removal of the code. If a host have puppet disabled the [13:21:27] cleanup will never happen. [13:23:35] this host is insetup, but I guess it'll still generate alerts for y'all? [13:29:13] stevemunene: I see puppet disabled on an-worker1177 and an-worker1156 for T390170 but the task is resolved. [13:29:14] T390170: Upgrade an-worker hard drives from 4TB to 8TB (group 3 - rack F5) - https://phabricator.wikimedia.org/T390170 [13:30:54] inflatador: is not that much about alerts but more that how the infra is setup all hosts in production should have puppet running except for brief periods. That's the general expectation of tooling and alerting. [13:31:30] practical example a new cumin host can't reach it now because has been provided after puppet was disabled and hence ferm rules were not updated to allow the new host too [13:32:02] what's the need for having puppet disabled? [13:32:04] ACK, will do my best to keep it running on all hosts. If it's an insetup host and it's been off for awhile, feel free to re-enable it [13:32:38] I'm experimenting with sysctl settings and I/O schedulers [13:33:06] There are probably ways to do that without disrupting Puppet though [13:34:47] anyway, if this is a problem in the future feel free to re-enable Puppet on any insetup host that's pinging. I can always redo the changes [13:41:16] sure but better to find a more suitable workflow ;) [13:42:07] Indeed, that's the goal ;P [13:46:26] quick errand [13:55:18] \o [14:02:06] .o/ [14:05:23] o/ [14:09:45] They fixed the LVS stuff yesterday so we can start reimaging hosts again [14:50:07] quick errand v2 [15:41:25] inflatador: hey, just wondering did the lvs/vlan fix clear up the issues you were having? [15:42:03] topranks yeah, thanks for checking back in! Scripts are all working and I just reimaged a couple of hosts w/no issues [15:42:34] ok cool, yeah given those racks are fairly new just wanted to make sure there was nothing else with the connectivity perhaps not working, or sub-optimal [15:42:40] but everything looks ok my side that I can see [15:42:52] I have added a new checklist when we add new rack vlans to make sure the LVS get updated [15:43:17] though I think the best way forward is to move to using IPIP where we can, remove that L2 requirement [15:44:29] Yeah, we were an early adopter of IPIP for Cloudelastic (non-prod env)I remember talking to Valentín about that and he said IPIP wouldn't actually fix the problem? Maybe I'm misremembering [15:44:44] no it will [15:45:09] perhaps there was some other gotcha, but in terms of the L2 adjacency IPIP gets rid of it [15:45:24] we've ditched it in all the POPs (the vlan interfaces and the cables across racks), it's working well there [15:45:56] I think he was saying maybe the switch itself still needs the VLAN, as opposed to the host itself? I dunno, I'm out of my depth but we are definitely going to move to IPIP at some point in the future [15:46:18] I'll start a task for that so we don't forget [15:48:07] cool thanks [15:48:29] thanks again for helping out yesterday! [15:49:18] well, apologies for the oversight that caused the issue! [15:49:20] but np :) [16:02:18] workout, back in ~40 [16:19:38] * dcausse is starting to regret pushing a MR per dag I change... [16:19:52] there are quite a few dags :) [16:20:34] i dunno, maybe we need a middle granularity? Other teams have sub-dirs that further group dags. I suppose you're kinda doing a middle granularity with the tags as well [16:21:56] true, the main distinction so far is search vs query_service, but moving search dags to search/search feels odd... perhaps need more granularity inside search itself [16:22:47] lol, yea search/search would be odd [17:23:19] Lunch, back in ~40 [17:34:39] dinner [18:28:47] a meh proposal for sankey edges: https://phabricator.wikimedia.org/F59944900 [18:48:24] sorry, been back [20:11:51] ebernhardson: I like your sample Sankey diagram! Everyone should take a look and think about what it shows and what it means! [20:17:03] i hope it captures the interaction points and progression through them, but i have only started trying to get these edges from sessions, may find this is missing things [20:23:11] Ahh.. it looks good enough that I thought it was already based on real data. [20:24:03] oh, sadly no. Those are numbers i invented :P