[05:09:24] <_joe_> ottomata: mmkubernetes [05:09:30] <_joe_> the rsyslog module [05:09:45] <_joe_> ah sorry prometheus [05:09:58] <_joe_> then the prometheus kubernetes scraper [07:08:46] good morning folks [07:09:03] I checked change-prop's metrics after yesterday's deployment and I see [07:09:06] https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=codfw%20prometheus%2Fk8s&from=now-24h&to=now&viewPanel=27 [07:09:18] the start time matches more or less with my deployment [07:10:42] ah wait no I see https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&refresh=1m&var-dc=eqiad%20prometheus%2Fk8s&from=now-24h&to=now [07:11:01] the larger picture shows that metrics started to be re-published again after the pod restarts [07:11:35] https://phabricator.wikimedia.org/T328683 [07:11:36] okok [08:01:14] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10Marostegui) [08:13:43] <_joe_> elukey: do you know if changeprop uses prometheus-statsd-exporter or just exports prom metrics natively? [08:14:09] _joe_ no idea [08:14:10] <_joe_> what i think happens in the latter case is that the worker holding the prom metrics crashes because of memory occupation or something [08:14:24] <_joe_> it's the only thing that runs for so long in production [08:21:12] makes sense yes! [08:30:40] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Optimize k8s same row traffic flows - https://phabricator.wikimedia.org/T328523 (10ayounsi) >> However using Calico's numAllowedLocalASNumbers config knob will be needed, as all the nodes from a given cluster use the same AS#. > > You could als... [09:02:43] o/ is there any maintainance going on wikikube@codfw? I have a pod in a pending state "0/20 nodes are available: 13 Insufficient cpu, 3 node(s) were unschedulable, [09:02:45] 4 node(s) had taints that the pod didn''t tolerate. [09:10:07] dcausse: in theory no, what namespace? [09:10:24] elukey: rdf-streaming-updater [09:10:31] ack checking [09:12:51] I see 3 nodes with scheduling disabled, 2009/2010/2020 [09:12:55] mmmmm [09:13:40] I don't see anything for 2009 in sal an phab atm [09:15:16] do others know why 3 nodes are (IIUC) cordoned? [09:16:21] ahh I think it was done for https://phabricator.wikimedia.org/T327001 [09:17:03] _joe_ do you think that we could uncordon them? [09:17:19] or akosiaris [09:18:45] sigh just lost power here, now on a poor 4g connection [09:20:39] dcausse: there are 3 nodes without pods (either for an experiment or due to an old reason) but it is weird that there is no space on others though [09:21:51] <_joe_> Ues [09:22:08] <_joe_> *yes [09:22:37] I request a full cpu, perhaps I can lower that if it helps? I doubt that in normal conditions it's what we need [09:23:01] spot checking the other nodes with kubectl describe, it seems that they are busy [09:24:11] I'll wait for Alex to see if he is running an experiment (saw some chats on the k8s chan about it) [09:25:24] elukey: I am running an experiment on these nodes [09:25:35] gimme a few more minutes and I 'll fully pool them back in action [09:25:58] ack! [09:26:01] but I did not expect that we would have such an issue [09:26:29] maybe we need to revisit a bit the charts affinities and tolerations [09:33:53] dcausse: elukey: nodes uncordoned, you should be good to go [09:34:06] akosiaris: thanks [09:34:27] it's unblocked [09:35:06] super [09:35:10] I see all pods running now [09:36:40] elukey: thanks! [09:57:22] so, it's not affinity or tolerations that caused the issue apparently. Just that we did not have enough capacity around ? [09:57:51] hmm that's not good. losing 3 nodes gets us into a state we can't deploy something ... [10:00:50] <_joe_> akosiaris: we have two more nodes to install [10:02:29] yup [10:02:34] we need to get on it apparently [10:56:13] 10serviceops, 10DBA, 10Data-Engineering-Planning, 10Data-Persistence, and 11 others: codfw row A switches upgrade - https://phabricator.wikimedia.org/T327925 (10MoritzMuehlenhoff) [11:14:45] _joe_: changeprop uses the exporter [11:15:49] elukey: those (seeming) upticks aren't real increases in time, just busted metrics [11:27:11] <_joe_> hnowlan: uh, then I have no good explanation for this [11:41:49] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) [11:44:18] _joe_: it is almost certainly something to do with how the changeprop workers are spawned/respawned [11:44:48] gonna spend some time trying to get useful metrics/logs out of it today to start to unravel this [12:03:38] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10cmooney) > Unfortunately, as mentioned in https://blog.ipspace.net/2021/09/graceful-restart.html "BGP Graceful Restart (RFC 4724) looks like it’s been designed by cowboys" as there is... [12:16:28] 10serviceops, 10MW-on-K8s, 10SRE, 10SRE Observability: Ingest php-slowlog in logstash - https://phabricator.wikimedia.org/T326794 (10Clement_Goubert) We are now correctly sending, ingesting and storing slowlogs in ECS format. Next step, dashboards. [12:18:48] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) [12:20:33] 10serviceops, 10SRE, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10Trizek-WMF) Thank you Clément. @RZamora-WMF is my backup for this task. She will review all steps when I made the... [12:34:07] 10serviceops, 10Datacenter-Switchover: switchdc services cookbook should allow pooling services in both DCs (active/active) - https://phabricator.wikimedia.org/T290919 (10Clement_Goubert) Directly related work by @Joe that could use a couple eyeballs https://gerrit.wikimedia.org/r/c/operations/cookbooks/+/886038 [12:39:21] 10serviceops, 10MW-on-K8s, 10SRE: Use encrypted rsync for releases - https://phabricator.wikimedia.org/T289858 (10Clement_Goubert) I don't think this should be considered a blocker for {T327920} However, we should address it for mw-on-k8s and releases. [12:59:04] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Clement_Goubert) [13:00:31] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Clement_Goubert) p:05Triage→03Medium [13:01:33] 10serviceops, 10SRE, 10Patch-For-Review: Use encrypted rsync for deployment::rsync - https://phabricator.wikimedia.org/T289857 (10Clement_Goubert) I don't think this should be considered a blocker for {T327920}. [13:09:16] 10serviceops, 10DBA, 10Data-Persistence, 10SRE, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10jcrespo) CC dbas & cloud- This worries me- while labswiki won't have a lot of queries- there is no way to migrate the user to the other datacenter, like i... [13:10:03] 10serviceops, 10DBA, 10Data-Persistence, 10SRE, and 4 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10taavi) [13:11:49] 10serviceops, 10Prod-Kubernetes, 10PyBal, 10SRE, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10akosiaris) 05Open→03Declined I am gonna tentatively set this as `declined`. The Service IPs announcement path led to nowh... [13:15:11] 10serviceops, 10Infrastructure-Foundations, 10SRE, 10netops: Calico and BFD - https://phabricator.wikimedia.org/T328338 (10ayounsi) 05Open→03Resolved a:03ayounsi After a discussion with @akosiaris the initial BFD need was for an Anycast experiment and as explained in T238909#8585199 this is not in sc... [13:15:21] 10serviceops, 10Prod-Kubernetes, 10PyBal, 10SRE, 10Traffic: Proposal: simplify set up of a new load-balanced service on kubernetes - https://phabricator.wikimedia.org/T238909 (10ayounsi) [13:19:19] 10serviceops, 10DBA, 10Data-Persistence, 10SRE, and 4 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Marostegui) Unfortunately we can't do anything with the DB. So we might have cross DC queries unless #cloud-services-team can set up a labweb host in codf... [13:29:12] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Performance-Team (Radar): March 2023 Datacenter Switchover Blockers - https://phabricator.wikimedia.org/T328770 (10Clement_Goubert) [13:49:28] 10serviceops, 10ChangeProp, 10Content-Transform-Team-WIP, 10Page Content Service, and 3 others: Parsoid cache invalidation for mobile-sections seems not reliable - https://phabricator.wikimedia.org/T226931 (10akosiaris) Adding @hnowlan and @Eevans in case they are able to shed some more light on this one.... [14:25:40] 10serviceops, 10Data-Persistence, 10SRE, 10cloud-services-team, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10Marostegui) Leaving #data-persistence tag instead of #DBA. We can support #cloud-services-team as much as needed, but we can't really do a... [14:58:02] _joe_: can you give https://gerrit.wikimedia.org/r/c/operations/puppet/+/886362 a quick pass. its not the best solution but shuld be good enough, but want to make sure there is nothing im missing (better solution, or why this is a really bad idea) [14:58:38] 10serviceops, 10Data-Persistence, 10SRE, 10cloud-services-team, and 3 others: Check wikitech switchover from labweb eqiad - https://phabricator.wikimedia.org/T328768 (10jcrespo) >>! In T328768#8585356, @Marostegui wrote: > Leaving #data-persistence tag instead of #DBA. We can support #cloud-services-team a... [14:58:50] <_joe_> jbond: sure [14:58:58] thx [15:08:17] 10serviceops, 10Data-Persistence, 10SRE, 10Datacenter-Switchover, 10Patch-For-Review: switchdc should automatically downtime "Read only" checks on DB masters being switched - https://phabricator.wikimedia.org/T285803 (10Clement_Goubert) I've rebased and implemented one of @Volans recommandation on the CR... [15:17:18] _joe_: i can change that file to a symlink to avoid th latency issues? [15:17:43] <_joe_> that might create issues with apache [15:17:52] <_joe_> so no, it's ok this way [15:18:04] It's still less latency than a manual change that might get forgotten [15:18:13] <_joe_> I mean [15:18:18] <_joe_> is the change needed even? [15:19:22] idk anymore, I've been spending all my limited project management resources into wrangling the datacenter-switchover backlog into things that need to be done and things that don't :') [15:20:13] _joe_: ack that was also my thinking which (orrignaly had it as a link but change to a copy) [15:21:18] as to if the change is needed as in do we need to update the file at all (manually or automated) arguably it just data which can be ignored [15:22:18] but im guissing someone mentioned it to me as an issue at the time of the last one possibly kor.mat as they ar the only other person on the change [15:22:24] s/change/task [15:22:54] Yep, as I said I'm trying to round up the backlog leftover from after the last S/O [15:23:27] ither way im merging for now, can allways re-visit if it becomes more issue thn its woth [15:23:30] thanks [15:23:44] <3 [15:27:49] hey _joe_ circling back about https://phabricator.wikimedia.org/T271184 - this also affected summary endpoint as well. TL;DR; the template in question was receiving an edit war and that created several events to purge cache to the affected pages, some got stuck in cache. That happens for mobile-html and summary endpoints. [15:28:30] <_joe_> mbsantos: are we talking wikipedia pages or other projects? [15:28:37] eswiki [15:29:13] <_joe_> oh so it's a separate issue from the other one that was found [15:29:40] <_joe_> sorry, I have 1.5 hours of meetings now, hopefully someone else can help in the meatime [15:29:43] <_joe_> *mean [15:30:25] the question we have is - should a null_edit event to the template be enough to trigger cache re-render for summary and mobile-html? [15:30:35] <_joe_> in theory, yes [15:30:42] <_joe_> I am curious about what broke there though [15:30:54] <_joe_> any idea how widespread the issue is? [15:31:46] If there was a significant backlog, some events could have been expired before they were acted upon I guess? [15:32:27] the only way I can think of is to query cassandra for the affected pages and see cache lifetime, it this is available [15:32:38] but that's not something I've done before [15:33:42] but sampling is inconsistent and ineffective, the only pages I was able to see corrupted cache were the ones reported by community [15:34:17] I don't see any significant backlog buildup around the time in question [15:34:47] is that your references for the time? https://es.wikipedia.org/w/index.php?title=Plantilla:Ficha_de_persona&diff=next&oldid=148895648 [15:34:49] Also i didn't find any ratelimit/errors in logstash for that window either [15:35:12] that's the first vandalism edit [15:36:11] oh wait I see it now https://grafana.wikimedia.org/d/000300/change-propagation?orgId=1&from=1674867418000&to=1674909764000&var-dc=eqiad%20prometheus%2Fk8s&viewPanel=27 [15:36:45] there's a spike in processing time around the same time, unsurprisingly [15:36:51] but it doesn't look like anything went wrong per se [15:39:31] This isn't me absolving changeprop btw, it entirely could have been something going wrong that we haven't caught but on the face of things it seems like it's doing what it's supposed to [15:42:17] yeah, it looks like events were properly triggered and might even have been finished properly. But I wonder if an edit war would create a racing condition and some cache refresh actually got the vandalized version. [15:50:02] wonder if there might have been some conflict with the concurrency settings maybe [15:53:40] so, hnowlan what's the right way of clearing up cache for the affected page? if we considered change-prop and kafka are working correctly and the high volume + concurrency are the problems, triggering a null_edit event for the template should be enough, right? [16:00:13] <_joe_> mbsantos: the "wrong" version is still in restbase? [16:00:25] yes [16:08:49] <_joe_> mbsantos: I think so yes [16:09:00] <_joe_> (re: null edit) [16:09:09] <_joe_> it will take some time though :) [16:09:47] I can imagine. I've never done that before can someone assist me? Who should I reach out to do that? [16:10:02] can't hurt either way apart from time yeah. if we're worried about volume + concurrency then it might recur, but honestly I am not entirely convinced this is our issue [16:11:46] So, an Apps Engineer posted on a slack thread that: "In the Lucy Parsons case, it looks like the page/summary endpoint still appears vandalized. (This is what the app uses for the lead image)" [16:11:58] It made me wonder if the problem was always about the summary endpoint, not mobile-html [16:12:13] is that useful for investigation? [16:16:57] <_joe_> mbsantos: not really for us, maybe for y'all? Aren't we changing how the summary is rendered/served? [16:17:03] <_joe_> sorry, still in meetings [16:18:31] <_joe_> mbsantos: uhh https://es.wikipedia.org/api/rest_v1/page/mobile-html/Peste seems correct to me? [16:18:44] no worries [16:19:16] <_joe_> did someone purge that url specifically? [16:19:24] the phab description is outdated I believe [16:19:46] further down when it was re-opened - the problem is related to any page with infobox template for personalities [16:20:09] <_joe_> yeah [16:20:18] <_joe_> can I have one page that's vandalized now? [16:20:25] <_joe_> I see none in the phab task [16:20:41] I am also trying to find a page [16:20:46] <_joe_> I want specifically to check their edit history and see if there's any race condition [16:21:16] <_joe_> I could also try to search it in the purge topic [16:21:17] we had 2 reported that we ere able to see vandalized before purging half an hour ago, would that suffice? [16:21:37] <_joe_> I would prefer one still vandalized [16:21:46] <_joe_> but yes that could help too [16:21:52] https://es.wikipedia.org/wiki/Lucy_Parsons - was vandalized 30 minutes ago [16:24:57] <_joe_> ok so [16:25:15] <_joe_> that article hasn't been edited in forever, that already removes my first hypothesis [16:25:37] <_joe_> mbsantos: if it's just page/summary, are we now serving that from mediawiki or PCS? [16:25:48] <_joe_> still PCS, right? [16:26:21] <_joe_>