[06:48:57] ottomata: the images are in the production-images repository, we take the upstream docker files and we rebuild them [07:15:07] hello folks [07:15:32] I added some info to https://phabricator.wikimedia.org/T306649, and the full mesh setup change that Janis was talking about yesterday may need to happen sooner rather than later :D [07:16:07] let me know your thoughts [07:19:26] <_joe_> elukey: I'm not sure I get where that would emerge [07:19:35] <_joe_> (the full mesh setup change that Janis was talking about yesterday may need to happen sooner rather than later) [07:20:19] _joe_ Cathal wrote that ml-serve1005 tries to establish BGP sessions with the core routers at the moment, failing [07:20:30] since it is configured by Calico with a global config [07:20:38] <_joe_> ah I was reading your comments only lol [08:49:58] elukey: I thought about that yesterday night as well (the global sessions). Apart from excluding the nodes in "special" rows from the global sessions I kind of remember there are some rules about the order in which calico applies which sessions - but I can't seem to find that [09:01:51] o/ [09:01:54] oh hi [09:02:17] I see there is backlog to read [09:03:10] o/ [09:03:15] a lot of calico joy [10:43:14] 10serviceops, 10Release-Engineering-Team, 10Scap, 10User-brennen: Deploy Scap version 4.7.0 - https://phabricator.wikimedia.org/T306827 (10JMeybohm) 05Open→03Resolved >>! In T306827#7881534, @dancy wrote: >>>! In T306827#7879622, @JMeybohm wrote: >> Rolled out to canaries + deploy1002, still super slow... [10:56:28] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10JMeybohm) >>! In T288546#7882754, @Dzahn wrote: > CCing @JMeybohm based on inv... [11:52:03] 10serviceops, 10Math, 10RESTBase, 10Patch-For-Review: \land – Unclear why the page appears in an error-category - https://phabricator.wikimedia.org/T305613 (10WDoranWMF) @Physikerwelt What is the severity/urgency for this issue? [12:38:44] 10serviceops, 10Math, 10RESTBase, 10Patch-For-Review: \land – Unclear why the page appears in an error-category - https://phabricator.wikimedia.org/T305613 (10Wurgl) Not really severe. I was just confused, why a tiny small article is in some category of errorness articles, when there is no err... [13:56:37] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Tsevener) @Dzahn @JMeybohm I'm happy to generate another key for another envir... [14:44:37] 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10Jdforrester-WMF) [14:45:03] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:49:34] 10serviceops, 10SRE: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10Jdforrester-WMF) [14:50:33] 10serviceops, 10Release-Engineering-Team, 10Scap, 10User-brennen: Deploy Scap version 4.7.0 - https://phabricator.wikimedia.org/T306827 (10dancy) Thanks! [14:50:46] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [14:57:29] 10serviceops, 10Release-Engineering-Team, 10Scap: Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 (10dancy) [15:03:27] 10serviceops, 10Release-Engineering-Team, 10Scap: Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 (10elukey) This is due to a new feature for `git-lfs` on Buster, so probably only relevant to the ORES nodes (we are going to upgrade them to Buster). I can take care of the new package, an... [15:18:09] 10serviceops, 10SRE: Migrate node-based services in production to node14 - https://phabricator.wikimedia.org/T306995 (10Jdforrester-WMF) [15:45:45] 10serviceops, 10SRE: Provide node14 images for running production node-based services - https://phabricator.wikimedia.org/T306996 (10bd808) From https://nodejs.org/en/about/releases/: |Release |Status |Initial Release |Active LTS Start |Maintenance LTS Start |End-of-life | | --- | --- | --- | --- | --- | --- |... [16:05:13] 10serviceops, 10Infrastructure-Foundations, 10Scap, 10Patch-For-Review, 10Release-Engineering-Team (Priority Backlog 📥): Use scap to deploy itself to scap targets - https://phabricator.wikimedia.org/T303559 (10jnuche) [16:16:29] I was away when the meeting relating to this happened but I'm getting started on writing the chart for this service - could someone explain what the idea was behind a generic chart for cassandra-http-gateway-based apps and how it might be implemented (something in common_templates?)? https://phabricator.wikimedia.org/T304891#7823946 [16:19:05] hnowlan: AIUI the idea was basically to have a generic chart for "this kind of things" (cassandra-http-gateway-things) [16:20:16] so nothing special in case of chart creation - just don't create a "image-suggestion" chart but rather a "cassandra-http-gateway" chart [16:20:23] ahhh ok [16:20:25] makes sense [16:31:15] 10serviceops, 10Scap, 10Release-Engineering-Team (Radar): Deploy Scap version 4.7.1 - https://phabricator.wikimedia.org/T306998 (10thcipriani) [16:38:21] <_joe_> hnowlan: a la shellbox [16:38:36] <_joe_> what will change between deployments will be the docker image I guess [16:57:12] sgtm [17:37:44] 10serviceops, 10Product-Infrastructure-Team-Backlog, 10Wikipedia-iOS-App-Backlog, 10iOS-app-v6.9-Carp-On-A-Zamboni: Rotate APNS key before deploying Push Notifications to Production - https://phabricator.wikimedia.org/T288546 (10Dmantena) > There are entirely separate settings for the "staging cluster" and... [18:32:24] 10serviceops, 10Math, 10RESTBase, 10Patch-For-Review: \land – Unclear why the page appears in an error-category - https://phabricator.wikimedia.org/T305613 (10WDoranWMF) @Wurgl Thanks, that's super helpful to know! To be transparent, RESTBase is in a deprecated state and it is going to be diff... [18:41:24] so the new gitlab-* hardware has arrived in codfw as well [18:41:52] now being asked which partman recipe to use.. that might not be obvious yet since we only had VMs so far. I will look at the options though [18:42:24] checking if any was used in eqiad already or not [20:44:04] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, and 2 others: Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) You should be unblocked to install OS. partman recipe set to raid1-dev. [21:27:43] 10serviceops, 10SRE: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) [21:27:51] 10serviceops, 10SRE: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) 05Resolved→03Open [21:28:17] 10serviceops, 10SRE: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) a:05Papaul→03Dzahn [21:28:52] 10serviceops, 10SRE: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Dzahn) after https://gerrit.wikimedia.org/r/c/operations/puppet/+/785918 the conftool-data change does not appear on https://config-master.wikimedia.org/pybal/codfw/ ? [21:31:01] 10serviceops, 10SRE: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul) @Dzahn i think it is best to create another task for this issue and not reopen the rack/setup task. Thanks [21:42:32] 10serviceops, 10DC-Ops, 10SRE, 10ops-eqiad, 10GitLab (Infrastructure): Q3:(Need By: TBD) rack/setup/install gitlab100[3|4] and gitlab-runner100[2|3|4] - https://phabricator.wikimedia.org/T301177 (10Dzahn) confirming that the "gitlab" hosts should use a public IP and the "gitlab-runner" hosts should use a... [21:55:42] 10serviceops, 10Release-Engineering-Team: helm-linter started failing on operations/deployment-charts today - https://phabricator.wikimedia.org/T307043 (10dancy) [22:03:17] 10serviceops, 10Release-Engineering-Team: helm-linter started failing on operations/deployment-charts today - https://phabricator.wikimedia.org/T307043 (10dancy) [22:05:17] 10serviceops, 10Release-Engineering-Team: helm-linter started failing on operations/deployment-charts today - https://phabricator.wikimedia.org/T307043 (10dancy) p:05Triage→03High [22:15:39] hey, so I want to create a new namespace for image-suggestion and follow the docs and would be fine.. except in the diff I see also "developer-portal" namespace [22:16:10] looks like https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/773267/ has not been deployed yet [22:16:35] now I am sitting at the y/n prompt whether to do both [22:17:16] I guess I'll do it but only for both staging clusters, not prod [22:33:23] mutante: I don't think there will be any harm if you activate my developer-portal namespace. It will same someone else from doing so later. [22:33:28] *save [22:33:45] but if your long past that step no worries :) [22:34:11] 10serviceops, 10Wikimedia-Developer-Portal, 10Goal, 10Patch-For-Review, 10Service-deployment-requests: New Service Request: developer-portal - https://phabricator.wikimedia.org/T297140 (10Dzahn) Hi @akosiaris @JMeybohm Today I merged https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/77596... [22:35:13] bd808: I added both our namespaces. But only on staging-codfw. [22:35:40] :thumbs up: [22:36:58] when literally following the docs how to check if everything is ok I am running into a permission issue. I remember having that before and someone told me what I was doing wrong. But I want to fix the docs and need a reminder. So I asked on that ticket [22:37:12] will continue soon [22:44:44] mutante: `kubectl auth can-i --list` will show you what rights the current account has. For what it is worth that `kubectl get ns` command does not work for me on a known working service account (toolhub) [22:45:13] it also doesn't work for me in the Toolforge k8s cluster which makes me think it's a bogus get command [22:46:30] you might try `kubectl get all` which actually doesn't list all the things, but should list deployments, pods, replicasets, cron tasks, and a few other resources if they are present in the namespace [22:46:42] bd808: thanks for that command. useful. I remember running into this before, months ago, and there was an explanation for it. I just forget what it was and I want to fix those docs. Because I am literally following them and it won't be the last time. [22:47:43] it's specifically for the "what are you supposed to test after adding a namespace" [22:47:59] *nod* I see some other outdated things there like "kubectl get pods should show a tiller pod." [22:47:59] appreciate it but need to go afk and get back to this later [22:55:17] I figured out that I can do `kubectl get ns --as-group=system:masters --as=admin` with my super privileged credentials and that doing so lists all of the namespaces in the cluster. With that info in hand I realized that I can do `kubectl get ns $YOUR-SERVICE-NAME` with the service user creds. [22:57:38] (The `--as-group=system:masters --as=admin` stuff was in Toolforge, not any of the prod clusters)