[02:36:21] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10RLazarus) p:05Triage→03High [03:16:09] 10serviceops, 10Performance-Team: Investigate performance degradation at high concurrencies in php-fpm - https://phabricator.wikimedia.org/T293630 (10aaron) Using https://gist.github.com/AaronSchulz/28a2cc7701a33adca1479b5ff6530b2c and ab , apcu perfomance degradation was tested in a number of scenarios on a d... [06:37:51] 10serviceops, 10Release-Engineering-Team, 10Scap, 10User-brennen: Deploy Scap version 4.7.0 - https://phabricator.wikimedia.org/T306827 (10JMeybohm) a:03JMeybohm [06:55:50] 10serviceops, 10Release-Engineering-Team, 10Scap, 10User-brennen: Deploy Scap version 4.7.0 - https://phabricator.wikimedia.org/T306827 (10JMeybohm) Rolled out to canaries + deploy1002, still super slow as introduced with T305949 [07:13:53] 10serviceops, 10MW-on-K8s, 10Kubernetes, 10Patch-For-Review, 10Release-Engineering-Team (Radar): Kubernetes credentials on deployment servers should be available to deployers, not all users - https://phabricator.wikimedia.org/T305729 (10JMeybohm) >>! In T305729#7856071, @dancy wrote: > * `WARNING: Kubern... [08:10:33] 10serviceops, 10Patch-For-Review: Rebuild wikimedia-stretch docker image for repository updates - https://phabricator.wikimedia.org/T257327 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Closing this task, the remaining bits will be cleaned out when Stretch is removed completely. [09:15:37] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad: Decommission mw13[07-48] - https://phabricator.wikimedia.org/T306162 (10fgiunchedi) p:05Triage→03Medium [09:44:19] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Jgiannelos) [09:45:39] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Jgiannelos) I think that the mwscript supports this kind of operation but ad... [12:01:14] Hi, do you have any input on ^ ? I added some comments to describe the current situation. [13:04:22] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Tsevener) a:03Tsevener [13:18:05] akosiaris: o [13:18:07] o/ [13:19:34] I'd like to test https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/786264/ to see if it applies to the new row E/F switch configs, and I am wondering if you have in inpus (like "this is totally wrong!" :D) [13:24:05] <_joe_> elukey: akosiaris is out of office [13:24:18] <_joe_> he's back later this week IIRC [13:24:30] ah lovely, do you want to check a BGP change? :D Otherwise I'll merge and test in my ml cluster [13:24:47] (Cathal reviewed it, it should do what it is supposed to) [13:25:26] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Joe) >>! In T301600#7880080, @Jgiannelos wrote: > I think that the mwscript... [13:30:46] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Jgiannelos) I am not sure which caches it invalidates but articles were fixe... [13:32:26] it seems that it worked [13:32:32] I'll update the k8s docs [13:32:45] <_joe_> nemo-yiannis: do you have an URL that is still broken? [13:33:03] <_joe_> so that I can figure out which cache you need to clean up and what we should do [13:37:32] This article still doesnt load on the android app: https://ka.wikipedia.org/wiki/%E1%83%9B%E1%83%98%E1%83%AE%E1%83%94%E1%83%98%E1%83%9A_%E1%83%AF%E1%83%90%E1%83%95%E1%83%90%E1%83%AE%E1%83%98%E1%83%A8%E1%83%95%E1%83%98%E1%83%9A%E1%83%98 [13:41:39] There should be more, i just tested the top articles feed that show up on the app homepage [13:42:59] <_joe_> so the restbase url that fails is what? [13:43:45] elukey: if I may suggest something :) AIUI selectors are supported in BGPPeer objects as well. So you could do something like "nodeSelector: failure-domain.beta.kubernetes.io/region == 'eqiad' && failure-domain.beta.kubernetes.io/zone in { 'row-e', 'row-f' }" to not have to list every manually [13:44:20] assuming you have failure-domain labels set, ofc. [13:44:21] jayme: hey! I thought you were out so I didn't ping you [13:44:39] let's say I'm 50-70% back [13:44:52] one of the biggest mistakes that I made was not to add the failure domain label to the kubelet when restoring the cluster, so I'll have to do it manually [13:45:11] but anyway, I thought about them but the BGP peer ip changes with every ToR [13:45:43] oh [13:45:47] yeah...hmm [13:45:49] so I basically took the default route for every new node and added the new peer [13:45:52] yeah :( [13:46:27] <_joe_> nemo-yiannis: that url returns 200 from restbase [13:46:50] <_joe_> curl -L "https://ka.wikipedia.org/api/rest_v1/page/html/%E1%83%9B%E1%83%98%E1%83%AE%E1%83%94%E1%83%98%E1%83%9A_%E1%83%AF%E1%83%90%E1%83%95%E1%83%90%E1%83%AE%E1%83%98%E1%83%A8%E1%83%95%E1%83%98%E1%83%9A%E1%83%98" -I [13:47:02] jayme: if you are only 50% back please rest, we can discuss a refactoring when you are 100% :) [13:47:03] Yeah, this is what i am checking now. The app shows a 404 but the HTTP request manually works [13:47:58] elukey: it's fine. Just wanted to say that my availability is still a bit limited. But while I'm working, I'm 100% here :-D [13:48:41] jayme: ack :) [13:48:52] so let's talk about k8s 1.23 and calico! [13:48:55] * elukey is joking [13:49:42] have you thought about adding yet another label for the rack #? [13:50:31] something custom I mean (not in failure-domain.*) [13:51:30] I mean it's not that of an issue right now, bit otoh it's one more thing that needs to be done manually when adding nodes [13:51:42] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) [13:51:45] it could be a way to go, my initial though was to test the node: selector in the calico config, and then refine as others see the problem.. [13:51:49] <_joe_> yeah I am definitely against adding more manual stuff for adding a node [13:51:54] <_joe_> there's already enough [13:52:32] the alternative is to add per-rack configs in our calico peers, and assign special labels to nodes [13:52:45] so that calico pods, in theory, should just pick up the right one when starting [13:52:45] looking at profile::netbox::host, it could maybe be automated [13:53:00] <_joe_> we need per-rack rules? [13:53:17] <_joe_> I thought it would be per-row [13:53:32] <_joe_> which is ok I think [13:53:44] IIUC the new ToRs are layer3 devices, each one of them capable of BGP-peering [13:54:06] I was hoping for automation from netbox, like taavi said [13:54:15] so if we add a node in row E/F, like the new ml ones, they will need to peer with their ToR [13:54:44] <_joe_> let' [13:54:55] <_joe_> s not focus on the details of implementation [13:55:08] <_joe_> but rather on what we want this to look like [13:55:27] <_joe_> if we need to add one manual config per row, I would be ok with that [13:55:35] <_joe_> I mean in puppet itself [13:56:31] <_joe_> is that enough? [13:56:36] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10BTullis) Should we call this done, or should we leave it open pending an outcome on {T305358}? Many thanks again for all your support with this request @JMeybohm. [13:56:39] but it would have to be one per rack (in deployment-charts) [13:56:41] it is one BGPPeer config for each rack [13:57:00] and we can tell to calico to select pods on nodes with certain labels [13:57:32] <_joe_> elukey: wait, aren't you mixing two different issues? [13:57:48] 10serviceops, 10Data-Catalog, 10Data-Engineering, 10SRE, and 2 others: New Service Request: DataHub - https://phabricator.wikimedia.org/T303049 (10JMeybohm) Please keep this open as it is absolutely in a hacky state currently (DNS + service::catalog wise) [13:58:02] <_joe_> let me take a look at the proposed change [13:58:40] elukey: that last one I don't get. Why do we want to do that? [13:59:38] jayme: do you have an alternative? IIUC each TOR in row E/F would need to be listed in the deployment-charts' calico config [14:00:26] I mean it is what I have done in the code review, but with the node selector (and it works) [14:00:52] <_joe_> elukey: so basically right now any time we add a node to a new rack we need a patch like that? [14:01:11] yes, right. But that's a one time thing only. What I wanted to avoid is changing that over and over for new nodes in those rows [14:01:28] ^ that was for lucas line [14:02:13] <_joe_> I'm not sure where we can get the peer ip info from [14:02:14] jayme: yep yep, what I was proposing (from what you wrote above, but I have missed something probably) is to add label selectors instead of the node selector, so that we'll just have to add the proper rack label to a new node for calico to work [14:02:20] <_joe_> is that in the netbox api? [14:03:15] elukey: ah, okay. Maybe missunderstanding. The field in calico BGPPerr spec is called "nodeSelector" and is of type "selector" which means you can select nodes based on labels with it [14:03:20] _joe_ if we don't add new special labels to nodes (like this node is in rack EX etc..) yes, because I've used the node selector (that was only as test to see if all the new BGP configs were fine etc..) [14:03:37] jayme: exactly yes, sorry I didn't explain myself correctly [14:04:03] so we're potentially talking about the same thing all the time. I was just confused by you saying "select pods on nodes with certain labels" above [14:04:15] <_joe_> yeah me too [14:04:31] <_joe_> so if we just need one config per rack, I think it's ok [14:04:47] <_joe_> it would be great to autogenerate [14:05:06] and we need to add the node labels to every new node though, we cannot do it via the kubelet's args at the moment [14:05:06] (https://phabricator.wikimedia.org/T229397 <- this is the ticket to data from netbox in puppet btw) [14:05:12] <_joe_> but if we have added proper labels for rack/row to the kubelets [14:05:26] <_joe_> then it's standable, yes [14:05:36] <_joe_> elukey: why can't we? [14:06:00] elukey: I don't think that's correct. It will work via kubelet args as long as the args are there already when the node registers itself [14:06:10] <_joe_> ^^ [14:06:17] <_joe_> that's also my understanding [14:06:19] jayme: ack yes yes it makes sense [14:07:20] <_joe_> I would propose to use something like "wikimedia.org/node-location" as a label. [14:07:28] <_joe_> Please note this won't work with VM nodes [14:07:34] <_joe_> like the ones we use for sessionstore [14:07:46] <_joe_> those can be migrated across racks and IIRC rows [14:08:13] <_joe_> so I'm not sure how that would work in the new rows in eqiad, if we ever expand ganeti [14:08:25] ok I'll report this discussion in the task that Cathal created, people should already be subscribed, then we can work on it [14:08:43] thanks! [14:08:51] thanks for the brainbounce :) [14:09:34] fwiw, in case it helps, we're planning to refactor a bit the way ganeti VMs are represented in netbox so to have the ganeti groups too (basically what so far have been rows and are racks in the new network scheme) [14:10:04] I think for the future we need to re-think calico bgp anyways because this full-mesh system is going to not be ideal anyways when the # of nodes grow [14:10:32] <_joe_> jayme: you mean in 3 months when we start moving the mw traffic? [14:10:34] <_joe_> :P [14:10:41] yes :D [14:21:22] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10cscott) Restbase content doesn't have a TTL, so waiting for the TTL expire w... [14:30:39] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10cscott) Four-ish options, all not great: * Do a simple client-side script th... [14:37:05] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Jgiannelos) Something interested that I found while debugging the issue is t... [14:50:15] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) [15:34:42] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) [15:36:18] 10serviceops, 10Math, 10RESTBase, 10Patch-For-Review: \land – Unclear why the page appears in an error-category - https://phabricator.wikimedia.org/T305613 (10Physikerwelt) I guess the problem is somewhere near the following passage https://github.com/wikimedia/restbase/blob/ecef17bda6f4efc0d... [15:39:03] 10serviceops, 10Math, 10RESTBase, 10Patch-For-Review: \land – Unclear why the page appears in an error-category - https://phabricator.wikimedia.org/T305613 (10Physikerwelt) As a side note, I am currently looking into T302628 to simplify the entire setup. [15:39:50] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Joe) >>! In T301600#7881133, @cscott wrote: > Restbase content doesn't have... [15:55:37] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Joe) >>! In T301600#7881151, @cscott wrote: > Four-ish options, all not grea... [16:06:48] 10serviceops, 10Release-Engineering-Team, 10Scap, 10User-brennen: Deploy Scap version 4.7.0 - https://phabricator.wikimedia.org/T306827 (10dancy) >>! In T306827#7879622, @JMeybohm wrote: > Rolled out to canaries + deploy1002, still super slow as introduced with T305949 I created T306915 for further conver... [16:15:45] Hello, I'm investigating Knative Eventing as part of https://phabricator.wikimedia.org/T306800. Main use case ATM is hydration of MW state events (e.g. new events streams: html wikitext, diffs, etc.), but also ease of use for other stuff in the future. [16:15:56] Anyone here have thoughts/opinions/experience? [16:16:22] (I'm a total knative n00b) [16:19:49] ottomata: https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Knative -- the new ML stack uses it apparently [16:21:08] elukey: you the right person to talk to ^ about? [16:21:13] (ty bd808 ) [16:27:34] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10cooltey) p:05Unbreak!→03Medium [16:29:14] * elukey hides [16:29:57] ottomata: yep for sure :) So we currently use only knative-serving, and the version that we have is old (0.18.1) since upstream broke support with our k8s version [16:30:25] we are planning to upgrade k8s to a version like 1.23 in a near-ish future, after that we'll be able to upgrade to knative 1.x etc.. [16:30:36] that fully supports eventing with a ton of bug fixes [16:31:10] I am interested in it too since for ML it will surely be helpful in the future [16:31:33] (say pulling revision create events from kafka and automatically trigger a score, that then publishes a revision score event) [16:46:06] right, which is the same 'hydration' pattern we are looking at for mw events [16:46:59] elukey: where is the source for these serving images? [16:48:20] https://wikitech.wikimedia.org/wiki/Machine_Learning/LiftWing#Images [17:33:13] 10serviceops, 10Page Content Service, 10Product-Infrastructure-Team-Backlog, 10RESTBase, and 3 others: REST endpoints cannot handle requests from ka.wikipedia.org with Georgian titles - https://phabricator.wikimedia.org/T301600 (10Jgiannelos) It looks like the issue comes from the page/summary API. For exa... [20:01:08] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Tsevener) @Dzahn I think I'm just about ready for the key handoff, but I had... [20:12:19] 10serviceops, 10SRE, 10WMF-JobQueue, 10Sustainability (Incident Followup): Videoscalers fail health checks while CPU is maxed - https://phabricator.wikimedia.org/T306860 (10jhathaway) Another option would be to use cpu pinning via taskset(1), where ffmpeg is assigned to cpus 1-N and cpu 0 is left free to s... [20:23:18] I added a new jobrunner in codfw to production. mw2419. I gave it weight=25. other servers in jobrunner-codfw have weights 10, 20 and 25. But when I look at number of CPUs and amount of RAM with cumin like: cumin 'A:mw-jobrunner-codfw' 'grep processor /proc/cpuinfo | wc -l && grep MemTotal /proc/meminfo' that doesn't necessarily match the existing weights [20:23:35] but either way the new server has more memory than any other before [20:24:25] it's the new "Dell PowerEdge R440 - ConfigC 202107 (1U)" type that we have not had before [20:24:48] MemTotal: 131638240 kB [20:25:55] 48 processors like previous servers but more RAM. setting to "active" in netbox [20:26:36] next are app/api servers from https://gerrit.wikimedia.org/r/c/operations/puppet/+/785918 [20:30:36] no, it's actually 40 processors on the latest model [20:44:49] 10serviceops, 10SRE, 10Patch-For-Review: eventgate chart should use common_templates - https://phabricator.wikimedia.org/T303543 (10Aklapper) @Ottomata: A #good_first_task is a self-contained, non-controversial task with a clear approach. It should be well-described with pointers to help a completely new con... [21:04:58] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Dzahn) Hello @Tsevener received and replied :) [21:12:05] 10serviceops, 10SRE, 10Sustainability (Incident Followup): Set API server weights - https://phabricator.wikimedia.org/T304800 (10Dzahn) There is a new type of servers now: group D - mw2416, mw2417 and mw2418 - R440 - Xeon Silver 4210R 2.4G - (**40 processors, 128GB RAM**), that's only 40 processors vs 48 bu... [21:18:55] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Tsevener) @Dzahn thanks - replied again! @Jgiannelos do you know if the servi... [21:20:32] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Dzahn) I think the safest way to do this is if we add the new key with a new n... [21:24:29] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Dzahn) I have received the encrypted key and was able to decrypt it. [21:25:37] ^ adding a new key for the iOS push notification service, but using a new name so they can decide when to switch between keys (or revert) whenever they want [21:25:45] it was sent to me encrypted [21:26:52] well.. if I find it in private repo that is [21:43:58] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Dzahn) I have found the place in the private repository that has the old key.... [21:54:35] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review, 10User-jijiki: Remove parsoidJS leftovers from production - https://phabricator.wikimedia.org/T279059 (10Dzahn) [21:55:57] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review, 10User-jijiki: Remove parsoidJS leftovers from production - https://phabricator.wikimedia.org/T279059 (10Dzahn) I saw the open checkbox "remove puppet module" and then noticed the module was almost gone except that one template used by parsoid test ho... [21:56:08] ^ deleted the parsoid(-js) puppet module for real [21:57:19] 10serviceops, 10Parsoid (Tracking), 10Patch-For-Review, 10User-jijiki: Remove parsoidJS leftovers from production - https://phabricator.wikimedia.org/T279059 (10Dzahn) 05Open→03Resolved a:03Dzahn boldly claiming it's resolved - correct me if I'm wrong please [22:06:56] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Tsevener) Thanks @Dzahn! I just sent you a new one, this time including the ke... [22:14:04] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Dzahn) Hello all. so... after looking at the existing setup I would like to po... [22:48:27] 10serviceops, 10Generated Data Platform, 10Image-Suggestions, 10SRE, and 2 others: Blubber setup for Image Suggestions Service - https://phabricator.wikimedia.org/T305155 (10Dzahn) for updates here also see T304891#7869885 It seems you have already requested the Gerrit repo.