[04:30:02] 10serviceops, 10Platform Engineering, 10SRE, 10ops-eqiad: Degraded RAID on sessionstore1003 - https://phabricator.wikimedia.org/T291738 (10Marostegui) p:05Triage→03Medium [06:51:58] 10serviceops, 10SRE, 10Patch-For-Review: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10jijiki) 05Open→03Resolved a:03jijiki [08:13:14] 10serviceops, 10SRE, 10Patch-For-Review: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10Joe) Just as a side note, the number of timeouts on parsoid went from ~ 5k/day before the change to ~ 3k/day or less afterwards. That's a 40% decrease in the n... [09:01:46] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10akosiaris) >>! In T291707#7377028, @Legoktm wrote: > Open questions: > * Is there a better way to fix this / identify the problematic requests? Zotero logs are neigh to useless (actively harmf... [09:04:52] 10serviceops, 10Citoid: zotero paging / serving 5xxes after CPU spikes - https://phabricator.wikimedia.org/T291707 (10Joe) I think some form of readinessProbe is indeed necessary, either by adding a GET endpoint or adding curl to the zotero container and a script that allows checking POST requests work. [11:41:50] https://phabricator.wikimedia.org/T290750 fyi [11:42:40] I doubt service ops has an incentive right now to take point on this and push it forward, but seems time service owners do care. thankfully we aren't blocking them and we have images already for them [11:45:12] 10serviceops, 10SRE: Migrate node-based services in production to node12 - https://phabricator.wikimedia.org/T290750 (10akosiaris) From the #SRE side, we 've built and support * https://docker-registry.wikimedia.org/nodejs12-devel/tags/ * https://docker-registry.wikimedia.org/nodejs12-slim/tags/ They are b... [12:55:53] 10serviceops, 10Analytics, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10Ottomata) > If I can do beeline in stat1005 and look at the data This would be possible, but you'd have to e... [13:27:03] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) I think we should just proceed with eventgate, I'll do staging in each first. Will have to delete in staging... [13:27:51] hiya, I'd like to push https://phabricator.wikimedia.org/T291504 forward this week. [13:28:05] i'll have to do some repoolilng and deletion of deployments [13:28:16] i'll first do staging, but i wanted to checkk with you all first to make sure i do the right thing. [13:29:34] for staging, just helmfile -e staging destroy, followed by helmfile -e staging apply? [13:31:36] for codfw and eqiad, i don't yet know what to do...looking for some docs on how to temporarily depool and repool a service in a DC [13:31:45] expecting to find some confctl stuff... [13:35:07] ah ha [13:35:07] https://wikitech.wikimedia.org/wiki/DNS/Discovery#How_to_manage_a_DNS_Discovery_service [13:35:11] right? [13:39:44] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Plan for staging: ` helmfile -e staging destroy # wait and make sure all is gone. helmfile -e staging apply `... [13:40:06] _joe_: does ^ look right to you? for depooling and deploying a 'new' k8s deployment with the same name? [13:43:24] <_joe_> yes [13:43:25] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Joe) Don't forget to wait for the DNS TTL and/or lower the TTL before every depool/repool operation. so you might want... [13:43:43] <_joe_> yes [13:43:54] <_joe_> see my comment above for an additional suggestion [13:44:09] great thank you. [13:46:59] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Thanks, added this step into my comment above. [13:48:18] proceeding with staging [13:49:40] hmmm _joe_ Error: release production failed: networkpolicies.networking.k8s.io "eventgate-production" already exists [13:50:11] <_joe_> ottomata: uhm that's strange [13:50:17] <_joe_> thsi is staging, right? [13:50:18] i'm surprised one with that name already exists; perhaps my attempt at applying this last week created that alongside of the previously named one [13:50:19] yes [13:50:27] i did try this last week without destroying. [13:50:32] it failed because of that port conflict [13:50:33] <_joe_> ok I have meetings in 10 [13:50:41] <_joe_> so I can't really assist you [13:50:41] so maybe the network policy was added with that name...? [13:50:42] ok [13:50:43] <_joe_> akosiaris: ? [13:50:49] <_joe_> or effie [13:51:18] what did I do now? [13:51:20] <_joe_> sorry, which one are you trying to destroy and recreate? [13:51:28] eventgate-logging-external staging [13:51:29] * effie reads [13:51:31] <_joe_> effie: help otto figure out what to do with his deployment [13:51:52] sure sure [13:52:16] I hope he knows my regular fee [13:52:34] gratitude and emojis? [13:52:49] chocolate bars and beer [13:53:04] chocalate beers and bars, got it [13:53:11] :D [13:53:15] <_joe_> ottomata: helmfile -e staging apply worked [13:53:20] you just did it? [13:53:28] <_joe_> I think "create" does soemthing different [13:53:36] i did apply too [13:53:39] <_joe_> sorry I just had this spark in my mind and wanted to confirm [13:53:48] i must not have the magic touch :) [13:53:54] <_joe_> possibly [13:54:01] <_joe_> or you waited too little time before trying [13:54:10] hmm could be... [13:54:22] iirc, last week we said that ottomata would failover to the inactive dc, and destroy and recreate in eqiad? [13:54:26] <_joe_> also you can just, you know, delete the resource manually if it happens again [13:54:29] is that correct ? [13:55:07] effie: yes [13:55:07] https://phabricator.wikimedia.org/T291504#7380252 [13:55:22] i was just doing staging without anyd repooling [13:55:31] and it failed but it looks like it was temporary! [13:56:42] ok looks good, i'm going to proceed with eventgate-logging-external codfw tehn [13:56:45] this will be repooling [13:56:51] alright then [14:02:08] effie: just verifying, with a TTL=10, if i wait a few minutes after depooling, i should be good to go, yes? [14:02:28] i've resolved eventgate-logging-external.discovery.wmnet from a few codfw boxes [14:02:37] alright [14:02:41] and i did see it switch from 10.2.1.50 to 10.2.2.50 [14:02:56] so, i should be safe to destroy in codfw? [14:03:58] effie: just double checking...don't want to cause an outage [14:05:00] is everything as expected traffic wise? [14:05:57] huh, yes...but it looks like codfw doesn't really get much traffic in this eventgate anyway [14:06:23] ok, proceeding, i'll check traffic again when we do the opposite and failover to codfw [14:06:39] AH sorry nho. [14:06:40] yes. [14:06:43] it looks expected [14:06:48] (was looking at the wrong chart for a minute) [14:06:57] yes traffic lookks gone in codfw. [14:07:06] and increased in eqiad [14:07:12] ok, let's see what happens then [14:07:15] ok, proceeding [14:08:12] heh, helmfile destroy is not auto-logged in #wikimedia-operations [14:08:26] ok, going to wait a couple of mins [14:11:29] applying [14:13:53] looks good! [14:13:57] ok...repooling [14:16:14] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) Couple of more points: * https://grafana.wikimedia.org/goto/RY75JPHnz points out that wikifeeds envoy did indeed see the errors. In both downstream (the l... [14:19:04] ottomata: how does it look ? [14:20:44] hmmm not sure maybe not good, still looking [14:20:50] traffic dropped in eqiad but not in codfw... [14:20:55] but did not come back in codfw [14:21:20] anything in your logs ? [14:21:26] trying to look now [14:24:54] there is some kafka error [14:24:56] and then nothing [14:25:03] is it possible something went wrong with the networking policy/ [14:25:04] ? [14:25:18] the pod shouldn't be ready though [14:25:25] the readiness probe does a test POSt and produce [14:25:31] still investigating [14:31:02] i think things look like they are working, just traffic is not back...not 100% yet [14:31:14] i just realized i set pooled=true on deploy1002 instead of puppet master [14:31:26] the result looks fine though...but i reran on puppetmster just now just in case [14:34:01] ya the service is working fine in codfw [14:34:16] effie: not sure why traffic isn't back [14:34:17] https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-logging-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=codfw%20prometheus%2Fk8s&from=1632749649531&to=1632753249531 [14:34:27] or uh... [14:34:32] where it went when it left eqiad? [14:34:32] https://grafana-rw.wikimedia.org/d/ZB39Izmnz/eventgate?orgId=1&refresh=1m&var-service=eventgate-logging-external&var-stream=All&var-kafka_broker=All&var-kafka_producer_type=All&var-dc=eqiad%20prometheus%2Fk8s&from=1632749663126&to=1632753263126 [14:35:41] hmm is it possible something up front still thinks it is down? [14:35:43] there were some alerts [14:36:09] e.g. [14:36:09] https://icinga.wikimedia.org/icinga/ [14:36:12] oops [14:36:19] https://icinga.wikimedia.org/cgi-bin/icinga/extinfo.cgi?type=2&host=eventgate-logging-external.svc.codfw.wmnet&service=eventgate-logging-external+LVS+codfw [14:37:08] yeah... [14:37:09] curl: (7) Failed to connect to eventgate-logging-external.svc.codfw.wmnet port 4392: Connection refused [14:37:27] curl -v -k https://eventgate-logging-external.svc.codfw.wmnet:4392/v1/_test/event [14:37:40] i'm going to depool again...dunno what's wrong [14:37:42] refused? [14:37:51] now that is odd [14:37:52] yes [14:40:19] ok traffic coming back in eqiad [14:42:57] effie is pybal or something depooling the thing or not routing traffic because it was down for a little while? [14:44:27] can you curl kubernetes2017.codfw.wmnet:4392 ? [14:44:54] no, i can't telnet either [14:45:19] however, i can in eqiad [14:45:46] hmmm maybe we have somethign wrong in the new version of the chart with the networking or Service [14:45:47] so pybal is not the problem [14:46:05] eventgate-production-tls-service NodePort 10.192.72.160 4392:4392/TCP 34m app=eventgate,routed_via=production [14:46:13] vs eventgate-logging-external-production-tls-service NodePort 10.64.72.17 4392:4392/TCP 188d app=eventgate-logging-external,chart=eventgate,routing_tag=eventgate-logging-external [14:47:01] ah ha [14:47:01] yes [14:47:19] kube_env eventgate-logging-external codfw; kubectl get pods -o wide --show-labels [14:47:24] app=eventgate-logging-external,chart=eventgate,pod-template-hash=975b5c946,release=canary,routed_via=production [14:47:43] so the pod didn't get the correct new labels [14:47:44] looking [14:50:36] right, because _tls_helpers still has something different. [14:50:41] app: {{ template "wmf.chartname" . }} [14:50:41] chart: {{ template "wmf.chartid" . }} [14:51:07] effie: related but later...when you have time, i would love your thoughts on https://phabricator.wikimedia.org/T282148#7373078 [14:51:52] I think we need a separate task [14:52:01] iok [14:52:05] tx tx [14:52:07] i'll make one and copy over comment [14:52:14] great [14:53:11] sigh effie ...shoudl i just revert this change and for now make tls_helpers etc. just do what petr wants? [14:53:17] that is, enable some envoy logging" [14:53:17] ? [14:53:30] maybe we should have that discsussion about labels etc. before we do this? [14:53:56] depends how long we want to drag this [14:54:18] can you help me understand the issue we have currently ? [14:59:32] aside from wanting to conform to the standard [14:59:46] the reason for doign this was so that we could get some errorlogs from envoy to investigate https://phabricator.wikimedia.org/T215001 [15:00:00] logging is configured better in the common_templates than what eventgate had [15:00:18] eventgate used its own copy of those template because we did canary releases before they were available commonly [15:00:37] now that canary etc. is in common, we should conform to common [15:00:49] ah meetings... [15:01:33] ottomata: I meant what is our issue now that we are getting a connection refused [15:04:31] effie: the issue is mismatched label selectors on the pod deployment vs the tls-proxy Service [15:04:41] the Service and pods are all up [15:04:59] but the selectors the Service uses isn't targeting the pods [15:19:24] oh... the 'app' portion of selector in the chart needs to be removed I guess.. [15:19:32] ouch [15:20:21] no, should be app: {{ template "wmf.chartname" . }} and no chart: [15:21:23] the fact that routed_via is not respected in _tls_helpers selector doesn't matter, service is not created in canary, right? [15:21:31] no it is not [15:21:44] (meeting) [15:31:15] 10serviceops, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Additional capacity on the k8s Flink cluster for WCQS updater - https://phabricator.wikimedia.org/T280485 (10Gehel) a:05Zbyszko→03Gehel [15:31:43] 10serviceops, 10Kubernetes: Clarify helm common label and service conventions - https://phabricator.wikimedia.org/T291848 (10Ottomata) [15:33:00] 10serviceops, 10Kubernetes: Clarify helm common label and service conventions - https://phabricator.wikimedia.org/T291848 (10Ottomata) Note: more, older, and verbose context in {T242861} [15:36:39] 10serviceops, 10Kubernetes: Clarify common k8s label and service conventions in our helm charts - https://phabricator.wikimedia.org/T291848 (10Ottomata) [15:54:52] ottomata Pchelolo can I help you somehow? [15:56:17] i'll be back on this after meetings and lunch in maybe 1h [16:00:11] actually, before lunch [16:00:22] i'm going to revert for now and redeploy to codfw [16:00:37] this is breaking some very minor stuff in analytics [16:00:45] better to get it all fixed and then try again [16:05:08] alright [16:08:29] 10serviceops, 10Analytics, 10Event-Platform: eventgate helm chart should use common_templates _tls_helpers.tpl instead of its own custom copy - https://phabricator.wikimedia.org/T291504 (10Ottomata) Ah, there were some mistakes in our patches: the tls Service wasn't using the same label selectors that the po... [16:12:52] 10serviceops, 10Analytics, 10Data-Engineering, 10Platform Engineering, 10Wikibase change dispatching scripts to jobs: Better observability/visualization for MediaWiki jobs - https://phabricator.wikimedia.org/T291620 (10odimitrijevic) [16:16:18] ok codfw service looks fine now [16:16:19] repooling [16:17:48] restting TTL for now too [16:18:47] traffic going back up in codfw [16:18:57] ok, am around but making lunch, back on to figure out envoy logging etc. after [16:24:53] 10serviceops, 10SRE, 10Wikifeeds: wikifeeds in codfw seems failing health checks intermittently - https://phabricator.wikimedia.org/T290445 (10akosiaris) I 've went ahead and created https://grafana-rw.wikimedia.org/d/Y1UyyEH7z/t290445?orgId=1 to depict the findings in grafana a bit more. It's not the entire... [16:29:25] 10serviceops, 10SRE, 10Patch-For-Review: Deploy PHP patch for DOM replaceChild/removeChild performance - https://phabricator.wikimedia.org/T291052 (10ssastry) This bug fix is definitely one of the reasons, but we also rolled out a number of other perf tweaks / improvements that Tim made in Parsoid itself in... [17:27:47] 10serviceops, 10Analytics, 10Data-Engineering, 10Event-Platform: Enable envoy tls proxy logging from eventgate - https://phabricator.wikimedia.org/T291856 (10Ottomata) [17:30:31] effie: Pchelolo: https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/724137 [17:59:09] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-9), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10Daimona) [18:48:24] anyone know how i can verify/find envoy http error logs from the envoy sidecar container? [18:48:29] oh wait [18:48:35] is it just kubectl logs.... [18:48:36] i bet it is [18:49:29] nice ok. [19:05:40] looking fine, proceeding with deploying updated envoy configs [19:16:07] legoktm: anything I can do to prep for setting up the LVS for Toolhub? [19:36:26] ok deployed that for all eventgates [19:36:29] Pchelolo: should have envoy logs now [19:36:35] wooohooo [19:36:38] thank you ot [19:36:40] ottomata: [19:36:49] not totally sure how to see them all other than kubectl logs per pod [19:36:53] are they in logstash somehow? [19:37:15] ya... there's a dashboard somewhere [19:38:53] ottomata: https://logstash.wikimedia.org/app/dashboards#/view/7f883390-fe76-11ea-b848-090a7444f26c?_g=h@c823129&_a=h@4506114 [19:40:34] boom, there's our first log [19:42:14] i think i need a share url [19:44:28] https://logstash.wikimedia.org/goto/1ef5e4c1749e58225ef29b8077e4ae20 [19:45:05] nice! [19:45:07] ok but why!? [19:50:51] reading more.. [19:51:04] all of them have UC flag, which is UC: Upstream connection termination in addition to 503 response code. [20:13:19] ottomata: I know the reason [20:13:30] like, for sure. gerrit coming [20:34:02] 10serviceops, 10SRE, 10wikidiff2, 10Community-Tech (CommTech-Sprint-10), 10Platform Team Workboards (Platform Engineering Reliability): Deploy wikidiff2 1.12.0 - https://phabricator.wikimedia.org/T285857 (10ldelench_wmf) [20:39:32] Pchelolo: ! interesting. [20:39:39] like for sure. [20:39:52] seems like it's a common problem a lot of people on the internet are experiencing [20:39:55] ok i have to leave in approx 50 minutes, shouldl i roll this out now or wait til tomorrow? :) [20:40:08] up to you [20:40:13] I'm in no rush :) [20:40:20] k would prefer tomorrow just in case [20:40:23] will do my morning [20:40:39] i guess rather nice that we aren't symlinking _tls_helpers eh? we can experiment :p [20:40:40] cool. we'd also have more 503 logs to compare the rate [20:40:51] aye [20:41:33] Pchelolo: ...would you want to make it more than the nodejs timeout? [20:41:43] you want nodejs doing the timeout right, not envoy if possible? [20:41:43] no, needs to be less [20:42:06] this is keepAlive timeout. not a real timeout [20:42:18] oh [20:42:20] so is this the same thing as https://phabricator.wikimedia.org/T287288#7265748 ? [20:42:22] oh [20:42:45] yeah, same thing. but this time it's local connection [20:43:00] less likely cause no network latency, but still possible [20:43:09] hm [20:51:28] bd808: not that I can think of, will try to get it done tonight [20:51:56] * bd808 quivers with excitement [21:14:57] 10serviceops, 10Analytics, 10Analytics-Kanban, 10Data-Engineering, 10Event-Platform: Enable envoy tls proxy logging from eventgate - https://phabricator.wikimedia.org/T291856 (10Ottomata) [23:22:09] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:(Need By: TBD) rack/setup/install mw241[2-9].codfw.wmnet - https://phabricator.wikimedia.org/T290192 (10Papaul)