[08:38:01] 06serviceops, 10Deployments, 06Release-Engineering-Team: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867 (10hashar) 03NEW [08:39:26] 06serviceops, 10Deployments, 06Release-Engineering-Team: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#9656726 (10hashar) That got followed by a 503 which I haven't found the root cause for: ` 08:28:14 Executing check 'check_test... [08:55:15] hi folks, I have re-deployed the prometheus patch to only fetch envoy in-use metrics, I've checked e.g. https://grafana.wikimedia.org/d/b1jttnFMz/envoy-telemetry-k8s for eqiad and things seem to be right, please double check [08:56:05] the change being https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013515?usp=dashboard [08:57:08] <_joe_> godog: LGTM right now [08:57:39] ack, thank you _joe_, I'll reenable puppet in codfw too [09:00:22] <_joe_> I did check eqiad did I? [09:43:21] LGTM godog [10:41:55] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9656987 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2336.codfw.wmnet with OS bullseye [10:42:23] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9656990 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2337.codfw.wmnet with OS bullseye [10:42:51] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9656991 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2386.codfw.wmnet with OS bullseye [10:43:20] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9656995 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2387.codfw.wmnet with OS bullseye [10:43:48] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9656996 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2388.codfw.wmnet with OS bullseye [10:44:26] 06serviceops, 10Deployments, 06Release-Engineering-Team: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#9656998 (10hashar) >>! In T360867#9656730, @Joe wrote: > I don't think httpbb tests should really break deployment, but rather... [10:44:27] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657000 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cgoubert@cumin1002 for host mw2389.codfw.wmnet with OS bullseye [10:47:45] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657017 (10Clement_Goubert) 05Open→03In progress [10:59:48] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657076 (10Jgiannelos) [11:01:08] 06serviceops, 10Deployments, 06Release-Engineering-Team: httpbb appserver test breaks deployment of the week due to a timeout parsing page - https://phabricator.wikimedia.org/T360867#9657089 (10hashar) From the log server, the page routinely takes more than 10 seconds to parse :/ ` zgrep 'Parsing Barack Oba... [11:02:06] to double check eqiad is still depooled? [11:08:08] <_joe_> Amir1: lol no? [11:08:14] <_joe_> Amir1: yo mean codfw? [11:08:17] ah yeah [11:08:21] <_joe_> we've moved to eqiad [11:08:24] sorry, switchover confusion time [11:08:29] <_joe_> so eqiad is very much not depooled [11:08:34] <_joe_> codfw, OTOH, is [11:08:38] <_joe_> for another day or two [11:08:41] until when? Tuesday? [11:08:56] good to know, I do some stuff today then [11:08:59] <_joe_> Amir1: as long as you need, preferably not further than wednesday [11:10:04] sure [11:20:29] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657159 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2386.codfw.wmnet with OS bullseye completed: - mw23... [11:21:56] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2388.codfw.wmnet with OS bullseye completed: - mw23... [11:24:19] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657172 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2336.codfw.wmnet with OS bullseye completed: - mw23... [11:25:50] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657178 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2387.codfw.wmnet with OS bullseye completed: - mw23... [11:26:08] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657181 (10Jgiannelos) From logs I think there are 2 things to investigate: * What happened since ~10th March ? *... [11:27:52] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657192 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2389.codfw.wmnet with OS bullseye completed: - mw23... [11:30:31] 06serviceops, 10MW-on-K8s, 13Patch-For-Review: Move servers from the appserver/api cluster to kubernetes - https://phabricator.wikimedia.org/T351074#9657205 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cgoubert@cumin1002 for host mw2337.codfw.wmnet with OS bullseye completed: - mw23... [11:36:48] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657229 (10hnowlan) Turnilo says that this is mostly being caused by clients using the mobile apps, various versi... [11:40:38] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657241 (10hnowlan) >>! In T360597#9657181, @Jgiannelos wrote: > From logs I think there are 2 things to investig... [11:45:24] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657257 (10Jgiannelos) I was trying to see if there is a correlation between this issue and switching over parsoi... [12:14:55] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate changeprop to mw-api-int - https://phabricator.wikimedia.org/T360767#9657385 (10Clement_Goubert) `mw-api-int` is now receiving all calls to `mwapi_uri` from changeprop {F43323601} There are still calls coming from the `ChangePropagation/WM... [12:18:13] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: 14Migrate changeprop to mw-api-int - 14https://phabricator.wikimedia.org/T360767#9657393 (10Clement_Goubert) 05In progress→03Resolved [12:18:44] 06serviceops, 10MW-on-K8s, 10RESTBase, 06SRE, 13Patch-For-Review: Migrate restbase from mwapi-async to mw-api-int - https://phabricator.wikimedia.org/T358213#9657395 (10Clement_Goubert) [12:23:56] 06serviceops, 10MW-on-K8s, 07Video: 14Create new flavour of shellbox for video transcoding - 14https://phabricator.wikimedia.org/T357296#9657406 (10kamila) 05Open→03Resolved a:03kamila 14Based on some quick tests the image seems to be working \o/ [12:25:07] 06serviceops, 10MW-on-K8s, 06SRE, 06Traffic, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9657411 (10Clement_Goubert) [12:45:04] 06serviceops, 10Prod-Kubernetes, 10Data-Platform-SRE (2024.03.25 - 2024.04.14), 07Kubernetes, 13Patch-For-Review: 14Migrate an example chart to the Calico network policies template - 14https://phabricator.wikimedia.org/T359411#9657459 (10brouberol) 05Open→03Resolved 14Both `superset-staging` and... [12:45:26] 06serviceops, 06Data-Platform-SRE, 10Prod-Kubernetes, 07Kubernetes: Migrate charts to Calico Network Policies - https://phabricator.wikimedia.org/T359423#9657463 (10brouberol) [12:54:50] hnowlan: 👋 for T360597 i have this feeling that the problem is redirects again. All the failing `requests.url` from logstash are redirects to commons. Is there any way I can see internally (eg from the container) what this returns: [12:54:59] curl http://localhost:6503/fr.wikipedia.org/v1/page/summary/Fichier%3ACleopatra_poster.jpg [12:59:29] nemo-yiannis: https://phabricator.wikimedia.org/P58907 [13:00:20] yeah same problem: Wikifeed times out because the location is the public URL [13:00:43] aha [13:01:00] 06serviceops, 06Data Products: Service Ops Review of Metrics Platform Configuration Management UI - https://phabricator.wikimedia.org/T358577#9657504 (10MShilova_WMF) Thank you, @akosiaris . I've just added you as a subscriber to {T358115}. Let me know if it automatically granted you access. [13:01:27] I dunno if this is the same as the issue causing the increased latency though [13:01:52] the increased error rate/timeouts though is because of that [13:02:59] i will update the ticket with the information and continue investigating whats causes the increased latency [13:05:10] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657528 (10Jgiannelos) From the URLs from logstash as @hnowlan pointed out it looks like the main cause of timeou... [13:06:00] cool [13:06:11] the timing of those restbase nodes being added is *very* fishy imo [13:07:59] yeah i dont think that the root cause for the increase in 10th march is this but the fact we have a spike last weekend probably added to the latency and triggered the alerts [13:08:24] (plus the switchover could have had an impact too) [13:09:19] yeah switchover means that all errors are occurring in a single datacentre so they'll go above thresholds [13:09:25] yeah [13:15:13] hnowlan: I am not very familiar with the SAL logs for pooling/depooling nodes but FWIW scap targets were not updated so not sure what state of restbase the new nodes are serving [13:15:30] i just did a git pull on restbase/deploy and the changes to targets were fetched [13:16:22] me neither tbh - they're pooled and so will be receiving requests [13:16:23] So this: https://gerrit.wikimedia.org/r/c/mediawiki/services/restbase/deploy/+/1009842 is not deployed [13:16:42] aha [13:16:46] (unless done in another manual way other than `scap deploy`) [13:16:52] they'll have pulled whatever the master version was at the time of provisioning [13:16:56] ok [13:17:04] should be do a deploy> [13:17:15] I can do a scap deploy [13:17:29] but before can somebody check which hash of restbase we are running? [13:17:50] wary about that disabled puppet state on those hosts but it's been a few days [13:18:16] on restbase1042 it's 7e5e72087d8331131669babfb8f40b269c024cd7 [13:19:08] either way scap overrides configurations so if scap is not running, config would be different [13:19:24] *has not run [13:22:45] ok i think thats even more interesting: https://phabricator.wikimedia.org/P58908 [13:22:58] check the differences between the notes redirect url [13:23:32] We should different bring the nodes to the same state [13:23:41] if you think its OK i can run a scap deploy on current master [13:24:52] the responses are very problematic [13:28:29] there's nothing in the diffs that should cause that kind of deviation in behaviour surely [13:28:35] but I'd say go for it [13:29:15] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657593 (10Jgiannelos) It looks like this path was not deployed using scap: https://gerrit.wikimedia.org/r/c/medi... [13:30:39] should i try just one node of the failing ones ? [13:31:01] to see if the response changes after the deployment? [13:31:11] please do [13:35:26] didn't help much [13:36:06] hm :/ which host did you deploy to? [13:36:29] 1034 [13:37:04] restbase1034.eqiad.wmnet [13:39:38] Also with "cache-control: no-cache" restbase returns 404 [13:39:42] curl -v -o /dev/null "http://restbase1034.eqiad.wmnet:7233/fr.wikipedia.org/v1/page/summary/Fichier%3ACleopatra_poster.jpg" -H "Cache-control: no-cache" [13:46:23] all hosts do the same it seems [13:48:37] that 404 isn't coming from restbase though [13:48:56] oh wait no, ignore [13:50:34] yeah all of them are returning 404 [13:50:51] which means we are serving the latest good state in cassandra [13:53:15] now we need to find why RB returns 404 and what 500 it hides :P [13:57:50] wait so [13:57:59] restbase1034 now returns the correct location header [13:58:14] where it previously had an incorrect one? [14:00:11] If that's correct then doing a scap deploy to get everyone on the same version will get things in better shape [14:01:25] no it doesn't [14:01:29] it returns the public URL [14:01:40] (which I assume is the last known state for this node) [14:01:56] because forcing purge returns 404 [14:05:14] (which i believe it really is an RB error) [14:45:41] claime: o/ I'd need to deploy https://gerrit.wikimedia.org/r/c/operations/puppet/+/1013541 to upgrade the eqiad's docker registry nodes (the standby ones). Ok from your point of view or better to wait? [14:48:48] <_joe_> elukey: I'd say go on, just make sure no mediawiki deployment is ongoing [14:48:54] +1 thanks :) [14:52:09] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9657834 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=78701a88-bd13-4896-9ad1-88076e82347e) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [14:52:31] 06serviceops, 06Machine-Learning-Team, 13Patch-For-Review: Bump memory for registry[12]00[34] VMs - https://phabricator.wikimedia.org/T360637#9657836 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=9cabb1e2-3230-40ba-8e89-bce14ddf9042) set by elukey@cumin1002 for 1:00:00 on 1 host(s) and... [15:04:56] bumped both nodes, all good :) [15:05:04] going to schedule the work for codfw then [15:05:16] when would be the best time to do it? [15:05:32] surely far from mw deployments or busy deploy schedules [15:20:51] elukey: not in mw-deploy windows. I think you could claim an mw infrastructure window for that [15:21:11] make sense yes [15:27:37] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9657965 (10Eevans) >>! In T360597#9657227, @hnowlan wrote: > Turnilo says that this is mostly being caused by cli... [15:36:13] jayme: the new prometheus job for istio seems to work as expected, and the labels are dropped [15:42:21] the change added about 16k samples/s more to the scraping, is that expected ? [15:42:57] I'm looking at this for example [15:42:59] https://grafana.wikimedia.org/d/GWvEXWDZk/prometheus-server?orgId=1&refresh=1m&var-Prometheus=prometheus1005%3A9906&var-RuleGroup=All&var-datasource=thanos&var-name=k8s&from=1711370571232&to=1711381371232&viewPanel=1 [15:44:26] k8s-pods-istio (4848/10668 up) mmhh I don't think that's expected, so many targets? [15:44:51] I'm looking at this https://prometheus-eqiad.wikimedia.org/k8s/targets?search=&scrapePool=k8s-pods-istio [15:44:54] elukey: ^ [15:46:22] godog: argh I as about to ask in #observability if there were metrics to check [15:46:47] in theory no, I was spot checking on the thanos' ui for before/after of some metrics [15:47:22] we should be matching __meta_kubernetes_pod_annotation_sidecar_istio_io_inject to be true I think? otherwise the k8s-pods-istio job on prometheus k8s is trying to fetch from all targets [15:48:07] godog: either true or false, since "true" is set on sidecars and "false" on gateways, but at this point the regex that I added (.*) is wrong? [15:48:15] yeah that should be .+ [15:48:22] * elukey cries in a corner [15:48:27] of course sorry :( [15:48:28] fixing [15:48:38] probably in both places [15:49:07] godog: or (true|false), wdyt? [15:49:24] sure that works too elukey [15:53:21] godog: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1014035 [15:53:25] sorry again :( [15:53:36] after the rollout is there any cleanup that I can do? [15:54:32] elukey: I think the regex needs changing in the original job for symmetry [15:55:07] true true [15:55:49] fixed :) [15:57:29] elukey: no cleanup, that's fine [15:57:33] patch LGTM [15:57:41] super, rolling out in a bit [16:00:38] started now [16:12:31] k8s-pods-istio (176/176 up) [16:12:33] that's more like it [16:14:08] godog: where do I see that info? [16:14:14] anyway thanks a ton for checking [16:14:30] elukey: https://prometheus-eqiad.wikimedia.org/k8s/targets?search=&scrapePool=k8s-pods-istio [16:14:44] * elukey bookmarked [16:39:38] 06serviceops, 10Prod-Kubernetes: PodSecurityPolicies will be deprecated with Kubernetes 1.21 - https://phabricator.wikimedia.org/T273507#9658418 (10elukey) @JMeybohm thanks a lot for the great wikipage, it explains the problem very well. The only thing that worries me is the maintenance of those extra policies... [16:56:08] 06serviceops, 10Data Products (Data Products Sprint 11): Service Ops Review of Metrics Platform Configuration Management UI - https://phabricator.wikimedia.org/T358577#9658508 (10VirginiaPoundstone) a:03phuedx [16:56:11] 06serviceops, 10Data Products (Data Products Sprint 11): Service Ops Review of Metrics Platform Configuration Management UI - https://phabricator.wikimedia.org/T358577#9658502 (10VirginiaPoundstone) [17:18:34] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9658606 (10Jgiannelos) It looks like errors/latency are stabilized after depooling some nodes: {F43348837} {F433... [17:18:53] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9658608 (10Eevans) The restbase deployments are somewhat out of sync, with HEAD (eqiad) looking like: ` restbase... [18:01:18] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9658800 (10Eevans) After experimenting with targeted deployments, both to hosts that correlate to the correct beh... [18:55:21] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9659020 (10Eevans) From IRC: ` 1:13 PM so apparently the 404 is expected behaviour for "cache-con... [18:56:39] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9659023 (10Jgiannelos) So the root cause looks like is the following: * Apparently the 404 is expected behaviour... [18:57:38] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9659025 (10Eevans) >>! In T360597#9659020, @Eevans wrote: > From IRC: > > ` > 1:13 PM so apparent... [20:01:11] deploying change to the prometheus-apache-exporter that will make it work on all distros, including bookworm [20:01:43] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: Increased latency, timeouts from wikifeeds since march 10th - https://phabricator.wikimedia.org/T360597#9659275 (10Eevans) What changed here was that some hosts have been deployed with ipv6 dns records: ` eevans@rest... [21:00:42] 06serviceops, 06Content-Transform-Team, 06Content-Transform-Team-WIP, 13Patch-For-Review: 14Increased latency, timeouts from wikifeeds since march 10th - 14https://phabricator.wikimedia.org/T360597#9659525 (10Eevans) 05Open→03Resolved a:03Eevans 14The ipv6 dns records for all restbase hosts have...