[09:34:05] o/ I'm about to push the flink chart to staging [09:40:10] failed with: Error: release main failed: roles.rbac.authorization.k8s.io is forbidden: User "system:serviceaccount:rdf-streaming-updater:tiller" cannot create resource "roles" in API group "rbac.authorization.k8s.io" in the namespace "rdf-streaming-updater" [09:47:09] dcausse: o/ [09:47:19] I'll take a look in a bit [09:47:56] thanks! [09:54:36] oops [10:04:03] dcausse: please try again [10:06:08] sure [10:09:42] jayme: still running, is it waiting for the pods to be up and running? [10:10:31] it will wait for them, yes. I they fail to get up, the release will be rolled back (e.g. removed in case of first deploy) [10:12:59] http://localhost:6022/auth/v1.0 connection refused, it's swift, checking if the port matches [10:15:03] hieradata/common/profile/services_proxy/envoy.yaml seems to agree, 6022 is thanos-swift [10:18:24] jayme: is it ok to try again for debugging or is there a way to access logs of dead pods? [10:19:28] logstash should provide you with the dead containers logs [10:19:44] I can help with finding them in a second [10:19:57] but ofc. it's fine to try again as well [10:23:09] hm..looks like logstash did not pick up the logs [10:25:13] np, I'll try again this afternoon if that's not a problem [10:26:22] not at all [10:56:01] dcausse: btw, is the swift container created? [10:56:11] does teh app create it if it does not exist? [12:30:41] effie: I think swift containers are automatically created yes, also it seems that flink is not even able to establish the connection will investigate [12:34:02] looks the envoy sidecar is not starting, that would explain the failure [12:34:09] caused by field: "port_specifier", reason: is required [12:39:36] I see in clusters.load_assignment.endpoints.lb_endpoints.address a socket_address only specifying address: "127.0.0.1" without any port [12:40:38] cluster_name is "local_service" [12:45:38] sorry I was away [12:45:48] it wants .Values.main_app.port [12:46:10] ah that must be the service I want to expose, hm.. [12:51:32] lets add a port 8080 [12:51:37] and go on [12:51:59] dcausse: should I do it or you ? [12:52:57] effie: doing [12:53:36] effie: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/704781/ [12:57:26] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [12:58:48] 10serviceops, 10SRE, 10Patch-For-Review: bring 43 new mediawiki appserver in eqiad into production - https://phabricator.wikimedia.org/T279309 (10Dzahn) [13:13:07] whoo flink-session-cluster-main-jobmanager-57cf8b6b8d-r6r9b 2/2 Running 1 34s [13:13:17] now checking that it's actually running :) [13:14:12] woohooo! [13:14:27] spoke too soon http://localhost:6022/auth/v1.0 failed, status code: 503, status line: HTTP/1.1 503 Service Unavailable GET http://localhost:6022/auth/v1.0 => 503 : upstream connect error or disconnect/reset before headers. reset reason: connection failure [13:25:10] where is it trying to connect to? [13:25:22] thanos swift [13:25:38] ok let me look again at firewalling stuff [13:26:23] envoy says: [2021-07-15T13:19:29.319Z] "GET /auth/v1.0 HTTP/1.1" 503 UF 0 91 250 - "-" "Apache Hadoop Swift Client 2.8.1 from XYZ by vinodkv source checksum XYZ" "UUID" "localhost:6022" "10.2.1.54:443" [13:27:01] not sure how to interpret that, is it envoy returing 503 or swift (10.2.1.54:443) [13:29:00] looking at https://www.envoyproxy.io/docs/envoy/latest/configuration/observability/access_log/usage UF means Upstream connection failure so most probably envoy fails to contact swift [13:30:44] My question is if this is at the networking layer or not [13:31:05] firewalling is most likely not the issue [13:32:56] godog: FYI, might be related to thanos-fe2001 reimage? [13:33:00] ^^^ [13:33:51] possible but thanos-fe2001 is depooled [13:34:17] 10.2.1.54 is thanos-swift.svc.codfw.wmnet. [13:34:37] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1300.eqiad.wmnet` - m... [13:34:47] dcausse: is the error still on ? [13:34:55] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [13:35:06] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) p:05Medium→03High [13:35:51] godog: I'm deploying a new service and not sure thanos-swift is to blame yet [13:36:05] dcausse: ah ok, LMK in case [13:36:34] thanks! it's probably not since this service is running fine from the hadoop cluster [13:40:32] dcausse: does seem like firewalls [13:41:14] effie: thanks for digging into this! [13:41:24] np, ping me if I can help [13:41:33] effie: ping :P [13:42:18] I'm not sure how to fix firewall issues [13:43:08] ah sorry, I meant it is NOT [13:43:09] hahahaha [13:43:12] ah! [13:43:16] sorry for that :p [13:43:21] np :) [13:52:00] after re-reading https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/699380 for knative-serving I have some doubts about admin_ng vs services [13:52:45] knative-serving feels something different from a regular service, but I am wondering what is the line between a service and a admin_ng config [13:53:14] root permissions? [13:54:17] I don't see an egress rule allowing to go to 10.2.1.54:443 from the pod (kubectl get networkpolicy flink-session-cluster-main-jobmanager -o yaml) could it be the cause? [13:58:45] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [13:59:11] dcausse: looking [13:59:16] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Jelto) [13:59:19] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10Dzahn) [14:11:04] 10serviceops, 10SRE, 10decommission-hardware, 10Patch-For-Review: decom 44 eqiad appservers purchased on 2016-04-12/13 (mw1261 through mw1301) - https://phabricator.wikimedia.org/T280203 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1001 for hosts: `mw1301.eqiad.wmnet` - m... [14:12:00] 10serviceops, 10GitLab, 10Infrastructure-Foundations: request service IP / DNS name for gitlab-failover, apply puppet role on gitlab2001 - https://phabricator.wikimedia.org/T285870 (10Volans) p:05Triage→03Medium @Jelto and I went over this together and created the [[ https://netbox.wikimedia.org/search/?... [14:14:55] dcausse: I will have a look with janis, I am probably missing something [14:14:59] 10serviceops, 10GitLab, 10Infrastructure-Foundations: request service IP / DNS name for gitlab-failover, apply puppet role on gitlab2001 - https://phabricator.wikimedia.org/T285870 (10Dzahn) host gitlab-replica.wikimedia.org gitlab-replica.wikimedia.org has address 208.80.153.105 gitlab-replica.wikimedia.org... [14:15:20] effie: thanks! [14:15:23] 10serviceops, 10GitLab, 10Infrastructure-Foundations: request service IP / DNS name for gitlab-failover, apply puppet role on gitlab2001 - https://phabricator.wikimedia.org/T285870 (10Jelto) a:03Jelto [14:19:41] I will ping you when we have something [15:41:50] 10serviceops, 10Machine-Learning-Team, 10SRE, 10Kubernetes, 10Patch-For-Review: Add the possibility to deploy calico on kubernetes master nodes - https://phabricator.wikimedia.org/T285927 (10elukey) 05Open→03Resolved a:03elukey istio bootstrapped, everything worked nicely, thanks a lot to all that... [16:55:02] dcausse: ah, I see we're doing the same thing here :D [16:55:30] jayme: yes I should have seen your patch earlier! :) [16:55:45] I abandonned mine it was still missing stuff reading yours :) [16:55:53] I was first? Great :D [16:56:00] by far :P [16:56:47] still fighting CI, but I think its happy now [16:57:23] I wanted not to force people to define kafka.allowed_clusters in every chart [16:57:36] yes makes sense [17:01:29] dcausse: are you fine with waiting for e.ffie's review and deploy tomorrow? [17:01:50] jayme: sure, I was about to go offline anyways [17:02:21] thanks a ton to you both for the help! [17:02:50] yw! [17:04:22] dcausse: oh, I did not include the actual "allowed_clusters" in the helmfile.d values files. You might want to keep that part of your change [17:04:45] sure I'll push a new patch tomorrow, np [18:22:38] 10serviceops, 10MW-on-K8s, 10SRE, 10Release-Engineering-Team (Radar): The restricted/mediawiki-webserver image should include skins and resources - https://phabricator.wikimedia.org/T285232 (10dduvall) >>! In T285232#7199870, @Joe wrote: > So after some more scavenging, We need the following directories to... [18:43:31] * maxbinder Hey, folks! Someone in Product without IRC was hoping for Service Ops attention on this task: https://phabricator.wikimedia.org/T285219 The Phab task is tagged with "#serviceops", and I came here via https://www.mediawiki.org/wiki/Wikimedia_Site_Reliability_Engineering#Service_Operations [18:43:41] Am I in the right place? :) [18:45:43] maxbinder: you're in the right place but I'm not sure who's the right person from the team to take a look :) let me see if I can figure anything out [18:48:12] (I see j.ayme already commented but it's late in his evening now) [18:49:31] * maxbinder thanks! [19:01:16] maxbinder: okay, caught up [19:02:10] my understanding is that error originates in the MW API, and the proxy is just passing it along to cxserver, so we should treat this the same as any other error from the API [19:02:43] the new thing is that the added proxy in the middle makes it harder to trace requests and troubleshoot what's going on at the API layer [19:02:48] have I understood right so far? :) [19:07:04] looks like the circle has closed :D [19:09:23] rzl: what is not clear to me is whether the tls-proxy is generating the 503 response (shown in the task) for a failed request, or whether it is forwarding it verbatim from the api servers [19:10:21] Nikerabbit: off the top of my head, it may or may not be changing the status code (turning a 500 to a 503 or something) but the *body* looks very much like something the api server would generate [19:11:44] I'm also somewhat suspicious at the rate of these errors for simple queries that should not fail... if it was a general problem outside tls-proxy I think people would be complaining... and my other guess is that the proxy hits rate limiting [19:11:53] that response body looks like that response body is https://gerrit.wikimedia.org/g/operations/mediawiki-config/+/master/errorpages/503.html to me [19:12:05] uh, pretend I didn't mangle the beginning of that :) [19:12:22] and just guessing from the file name, it looks like it was a 503 from MW originally, so the proxy is just proxying [19:16:05] assuming it's not tls-proxy, who would be able to help to figure out the source of these errors? Because I tried to search it in logstash and could not associate these with any other log entries [19:16:13] some kind of rate limiting is possible, but it's not the first thing I'd guess -- the requests should look the same to MW regardless of whether they came through a sidecar proxy or not, so I wouldn't expect the rate-limiting situation to look any different with or without a proxy [19:16:42] yeah I think that's the right question -- definitely not trying to dismiss this, just help narrow it down :) [19:17:12] unfortunately I don't have a ton of experience logdiving on these -- j.ayme is one of the first people I'd ask, and he commented earlier on the task so may be able to shed some more light once it's working hours for him again [19:18:09] fingers crossed he has some more ideas :) [19:19:08] failing that, I think it should be plausible to generate the same traffic directly to the apiservers and collect more debugging data that way -- I really do expect the same requests to fail at the same rate, and if that turns out *not* to happen we'd learn something else valuable [19:19:11] thanks both! glad the right connections are happening :) [19:21:16] but for the record the api servers are doing about 5K requests/sec -- so something generating 195 errors over 24 hours would be a tiny drop in the bucket, in terms of our overall error rate [19:21:36] that doesn't mean it's not worth fixing, but it does mean it's possible it's been happening all along, and it just didn't show up in any of the aggregate metrics