[05:23:56] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Joe) >>! In T337649#8939812, @Ladsgroup wrote: > oh I actually might know what's ha... [06:30:40] 10serviceops, 10SRE: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10Joe) 05In progress→03Resolved [06:30:45] 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe) [06:44:27] hi folks! [06:44:36] the toolhub's eqiad pods are not feeling great [06:44:36] NAME READY STATUS RESTARTS AGE [06:44:40] toolhub-main-5557c9fc9c-7ftvl 3/4 CrashLoopBackOff 572 (5m8s ago) 38d [06:44:43] toolhub-main-5557c9fc9c-d59s8 3/4 CrashLoopBackOff 858 (67s ago) 38d [06:44:46] toolhub-main-crawler-28139400-7vj6v 1/2 Error 0 43m [06:45:00] I didn't get a specific error, trying to delete one pod to see if anything changes [06:45:06] the codfw ones are ok [06:52:28] the crawler is now up, but the main pods fail the readiness probe [06:55:03] ah also toolhub.wikimedia.org is down, lovely [06:55:40] ahhh it is active/passive [06:56:20] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Marostegui) For what is worth, in this case, the issue was changing a dbproxy tha... [06:58:52] from https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m&from=now-2d&to=now it seems that the issues started around the first at ~21:30/40 UTC [06:59:53] I have no idea if a eqiad -> codfw failover is ok for toolhub, and since it has been down for the weekend it doesn't seem to be a huge priority, I'll wait for folks in serviceops to join this chan [07:15:53] 10serviceops, 10MW-on-K8s: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) [07:27:24] 10serviceops, 10MW-on-K8s: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) So, what I figured out from reading the apache documentation at https://httpd.apache.org/docs/current/mod/mod_log_config.html (emphasis mine) > Format Notes > For security reasons, starting... [07:27:59] 10serviceops, 10MW-on-K8s: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) [07:38:53] so I see, in the crawler: [07:38:54] MySQLdb._exceptions.OperationalError: (2002, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)") [07:42:51] <_joe_> elukey: why would we be on the hook for toolhub? [07:43:12] <_joe_> I mean it's not like runs on k8s -> serviceops has to fix it [07:43:47] _joe_ good morning to you :) [07:44:04] I just asked if anybody had ideas, already pinged people on #cloud [07:44:09] <_joe_> and to answer your question: no, toolhub is eqiad only [07:45:00] elukey: I suspect it's the same thing as in T340780, data-persistence is putting new dbproxies in service and the new hosts are not included in the k8s network policies [07:45:33] <_joe_> yep [07:45:43] <_joe_> better coordination would be welcome [07:46:09] taavi: ahhh nice, lemme check [07:46:23] <_joe_> "nice" :P [07:48:21] created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/934866 [07:53:10] adding the new netpolicy, let's see [07:55:45] worked, https://toolhub.wikimedia.org/ is up [08:07:41] _joe_: We got tagged on https://phabricator.wikimedia.org/T336627 do you know what's the jobs for account autocreation ? Is it CentralAuthCreateLocalAccountJob? [08:08:16] <_joe_> I would guess so [08:08:30] yes [08:12:37] Ok then they're right it took ~30 minutes at that time. It's since gotten a lot better, and we added quite a few servers to the jobrunners pool [08:12:48] Maybe since that job is time critical it should have its own lane [08:13:17] <_joe_> claime: and/or an alert [08:13:45] I'll check that out [08:59:13] I don't know how pertinent it is to give it its own lane, that's supposed to be for high traffic jobs, but it's not high-traffic, we just don't want it to wait [09:00:08] <_joe_> it's a slippery slope [09:00:55] <_joe_> now I'd say that we'd need an SLO for jobs, maybe with 3 different levels, and someone should decide on what latency is acceptable [09:01:18] <_joe_> more to my point: if we want something to happen "guaranteed quick", it shouldn't be a job [09:01:32] <_joe_> because we might have to turn off the jobqueue temporarily for any reason [09:05:20] I know, but in the current state it is one [09:05:37] And half an hour for account creation propagation is a bit much :/ [09:10:03] do we have stats on how long the job execution takes? I'm wondering if it would be reasonable to do the auto-creations on the initial sign-up request [09:10:32] or hm. it's cross-wiki, so I don't think I can do that very easily [09:27:58] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [09:38:49] 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Ladsgroup) Yeah yeah, For the first 50 pages. The problem was that it was queuing a... [09:50:20] 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm) [09:53:42] filed https://phabricator.wikimedia.org/T340950 as follow up for the toolhub incident [09:58:43] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10fgiunchedi) + o11y-logging for visibility, the un-escaping I think will work here (not tested) [10:08:54] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10elukey) Adding another datapoint: T340950 Toolhub uses `m5-master.eqiad.wmnet`,... [10:11:14] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Urbanecm_WMF) >>! In T340843#8984877, @elukey wrote: > Toolhub uses `m5-master.eq... [10:22:25] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki) [10:23:06] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki) [10:55:27] 10serviceops, 10SRE, 10Traffic, 10envoy: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm) [12:06:17] 10serviceops, 10Kubernetes: Integrate kube-metrics-server into our infrastructure - https://phabricator.wikimedia.org/T249929 (10BTullis) [12:46:57] 10serviceops, 10Machine-Learning-Team, 10Platform Team Initiatives (API Gateway): Review LiftWing's usage of the API Gateway - https://phabricator.wikimedia.org/T340982 (10elukey) [12:50:03] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) Could you please add some more information. e.g. I am not sure what exactly makes it invalid json. e.g. ` In [1]: import json In [2]: a=open('a.json', 'r')... [13:42:38] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) You must be doing something to the string when loading it from file, because `\\` is not valid json: ` astring = '{"a": "\\xe2"}' json.loads(astring) ... json.deco... [13:50:15] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye [14:13:04] 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.0 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime) [14:13:15] 10serviceops, 10Beta-Cluster-Infrastructure, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Install wikidiff2 1.14.0 deb on deployment-prep & test - https://phabricator.wikimedia.org/T340542 (10TheresNoTime) 05Open→03In progress >>! In T340542#8973199, @dom_walden wrote: > Ther... [14:18:32] 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) ah yeah, python will just DTRT and escape the backslashes properly [14:25:58] 10serviceops, 10MediaWiki-extensions-CentralAuth, 10Stewards-and-global-tools, 10WMF-JobQueue, 10Patch-For-Review: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) - https://phabricator.wikimedia.org/T336627 (10Clement_Goubert) These issues seem to have occurred before we added c... [14:26:07] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye completed: - kube... [15:09:52] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki) [15:50:37] 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) mw and restbase canaries as well as mathoid are running 1.23.10 since today. If nothing comes up I will roll the update out to the rest of the fleet tom... [16:29:25] 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki) [16:33:21] 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) We 've discussed this in the #serviceops meeting today. We touched on... [18:52:41] claime, effie, akosiaris: The Job Queue is backed up by over two hours... are you you aware of this? [18:53:32] There was a massive spike in parsoid cache warming jobs being inserted at 12:30 UTC. [18:54:41] Huh ? [18:55:21] Nope didn't see that, but you're lucky I was near my computer, it's 9PM here [18:55:48] claime: yes, here too :) [18:56:01] I was just about to head for the klaxon [18:56:28] I just realized that the wait time for individual jobs is different. So this only affects the parsoid jobs. still not good. [18:57:17] yeah and it's not concurrency, or at least I don't think so, even if the graphs are mostly wrong we're nowhere near max [18:57:35] Hm.. the refreshlinks job (equivalent for the old parser) is backlogged by 4 hours - but it'S just *always* like that? [18:57:54] jobrunners are fine [18:58:33] I am looking at https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&viewPanel=5 [18:59:22] refreshlinks has been backed up for a few days apparently [18:59:38] it's more spiky before 30/6 [19:00:25] right [19:00:48] Should we have alerts for this metrics? I have never used grafana alerts... [19:00:58] you know what I'm gonna up the concurrency again [19:01:15] We should fix the metrics first so they aggregatable, then we can make alerts [19:01:19] This is in progress [19:01:44] I thought the metrics issue only afftected the concurrency, not wait time? [19:01:53] refreshLinks is partitioned by mediawiki_database btw, and has a total concurrency of 240 [19:02:16] duesen: wait time is also a summary and not a histogram [19:02:43] hm, ok. [19:03:49] I'll up concurrency again see if it does anything, if not we'll look into partitioning maybe [19:04:10] but I don't think I can look into partitioning tonight, too much I don't know [19:08:25] in CI [19:11:46] duesen: deployed a bump to 150 [19:21:32] we're at least processing a bit faster, but insertion rate keeps rising so idk [19:24:17] duesen: https://grafana.wikimedia.org/goto/PEefK1rVz?orgId=1 [19:24:22] That's a lot of messaging [19:25:27] wow [19:25:43] what does "messaging" mean here? what does each message correspond to? [19:26:44] I think that's the number of kafka messages representing a job [19:26:58] so effectively how many messages there are in backlog [19:28:26] back of the envelope calculation says we should resolve it in around 2h [19:29:00] 1.6M messages/~250 ops/s [19:29:02] So, what happened in codfw and suddenly job insertion rate jumped from less than 100 to more than 250? [19:29:15] we have been having an elevated error rate [19:29:21] generally speaking [19:29:28] https://www.wikimediastatus.net/ [19:30:35] 18 errors per second? [19:30:40] What? [19:30:57] which is also evident here https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=63 [19:31:04] ok something else is going on [19:31:57] There's nothing in SAL around the time of that rise [19:32:29] ok let's see NEL and logstash [19:32:36] looking at logstash [19:32:38] I mean, NEL [19:32:43] looking at NEL [19:33:03] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&var-site=All&var-cluster=appserver&var-method=GET&var-code=200&var-php_version=All&from=now-12h&to=now&viewPanel=63 [19:33:09] I don't see much here [19:33:33] big exception spike around 1550 [19:33:38] There is increase but and appears to coincide but it's not 18 errors per second [19:33:40] [{reqId}] {exception_url} JobQueueError: Could not enqueue jobs [19:33:54] and then nothing much on mw backend errors [19:35:39] still, a lot of jobqueue errors [19:38:30] akosiaris: it could be a side effect rather than a coincidence [19:43:20] I think what we are seeing may be a stampede effect triggered when a template changes that is used on pages that are viewed a lot. [19:43:56] When a "stale" page is viewed, we put a job on the queue to re-render it. And while we wait for that to happen, we put more jobs on the queue. [19:44:27] And the more jobs we put into the queue, the longer it takes for things to get rendered, so we end up putting more jobs on the queue... [19:44:35] duesen: How long should a re-render job be considered "valid" ? [19:44:41] We can put an expiry date on jobrs [19:44:58] conceptually, there is no time limit [19:45:11] operationnaly speaking? [19:45:52] duesen: while this sounds plausible, I would love to know which change caused it, I know it is a longshot ofc [19:46:05] duesen: Backlog is decreasing after concurrency change btw [19:46:11] Currently, we'd have to look at the time it usually takes from loading a page to clicking edit [19:46:17] claime: goo d call [19:46:33] claime: thanky ou [19:46:47] So maybe like 5 minutes ? [19:47:11] It would probably cull some ? [19:47:16] Maybe. But... this isn't great forever. Also, more jobs will come. [19:47:35] Every time someone views the stale page, a new job will be enqueued [19:48:32] We could put a flag into memcached, and check it before queueing more jobs... [20:09:24] effie: on enwiki, rc_id=1649965457 looks like it might fit the bill. It happened nearly exactly at 12:30, and its used 75220. [20:09:36] Though that should cause no more than 75221 jobs... [20:10:10] Interestingly, the template wasn't edited. This is a wikidata change. I'm not even sure it would cascade the pages that use the template [20:10:22] Note that I only checked enwiki. Might be somewhere else. [20:10:31] thank you ! [20:10:38] could be, could be not, sigh [20:10:46] nothing on commons at the time... [20:11:07] The template in question is https://en.wikipedia.org/wiki/Template%3ATick [20:11:49] Damn ticks [20:12:01] Ah, not these ticks [20:16:41] Hm... interesting... there is also a much used tick template in frwiki [20:17:11] https://fr.wikipedia.org/wiki/Mod%C3%A8le:Fait [20:17:18] Which also got a bump at the same time. From wikidata. [20:17:30] Not a coincidence: https://www.wikidata.org/w/index.php?title=Q6154014&action=history [20:18:23] But IIRC, a wikidata change that causes a template to be re-rendered doesn't cascade to pages that use that template... or maybe they do? I may even have written that code, eight years ago... [20:18:47] But this shouldn't happen for interwiki changes. [20:20:52] maybe it's just a red herring... [20:26:40] duesen: lets pick this up tomorrow, the bandaid is working looks like [20:28:57] effie: thank you! [20:30:59] cheers :)