[05:23:56] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Joe) >>! In T337649#8939812, @Ladsgroup wrote: > oh I actually might know what's ha...
[06:30:40] <wikibugs>	 10serviceops, 10SRE: codfw (2) memcached host service implementation tracking - https://phabricator.wikimedia.org/T313968 (10Joe) 05In progress→03Resolved
[06:30:45] <wikibugs>	 10serviceops, 10DC-Ops, 10SRE, 10ops-codfw: Q1:rack/setup/install new codfw memcached hosts - https://phabricator.wikimedia.org/T313966 (10Joe)
[06:44:27] <elukey>	 hi folks!
[06:44:36] <elukey>	 the toolhub's eqiad pods are not feeling great
[06:44:36] <elukey>	 NAME                                  READY   STATUS             RESTARTS         AGE
[06:44:40] <elukey>	 toolhub-main-5557c9fc9c-7ftvl         3/4     CrashLoopBackOff   572 (5m8s ago)   38d
[06:44:43] <elukey>	 toolhub-main-5557c9fc9c-d59s8         3/4     CrashLoopBackOff   858 (67s ago)    38d
[06:44:46] <elukey>	 toolhub-main-crawler-28139400-7vj6v   1/2     Error              0                43m
[06:45:00] <elukey>	 I didn't get a specific error, trying to delete one pod to see if anything changes
[06:45:06] <elukey>	 the codfw ones are ok
[06:52:28] <elukey>	 the crawler is now up, but the main pods fail the readiness probe
[06:55:03] <elukey>	 ah also toolhub.wikimedia.org is down, lovely
[06:55:40] <elukey>	 ahhh it is active/passive
[06:56:20] <wikibugs>	 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Marostegui) For what is worth, in this case, the issue was changing a dbproxy tha...
[06:58:52] <elukey>	 from https://grafana.wikimedia.org/d/wJHvm8Ank/toolhub?orgId=1&refresh=1m&from=now-2d&to=now it seems that the issues started around the first at ~21:30/40 UTC
[06:59:53] <elukey>	 I have no idea if a eqiad -> codfw failover is ok for toolhub, and since it has been down for the weekend it doesn't seem to be a huge priority, I'll wait for folks in serviceops to join this chan
[07:15:53] <wikibugs>	 10serviceops, 10MW-on-K8s: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe)
[07:27:24] <wikibugs>	 10serviceops, 10MW-on-K8s: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) So, what I figured out from reading the apache documentation at https://httpd.apache.org/docs/current/mod/mod_log_config.html (emphasis mine)  > Format Notes > For security reasons, starting...
[07:27:59] <wikibugs>	 10serviceops, 10MW-on-K8s: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe)
[07:38:53] <elukey>	 so I see, in the crawler:
[07:38:54] <elukey>	 MySQLdb._exceptions.OperationalError: (2002, "Can't connect to MySQL server on 'm5-master.eqiad.wmnet' (115)")
[07:42:51] <_joe_>	 elukey: why would we be on the hook for toolhub? 
[07:43:12] <_joe_>	 I mean it's not like runs on k8s -> serviceops has to fix it
[07:43:47] <elukey>	 _joe_ good morning to you :)
[07:44:04] <elukey>	 I just asked if anybody had ideas, already pinged people on #cloud
[07:44:09] <_joe_>	 and to answer your question: no, toolhub is eqiad only
[07:45:00] <taavi>	 elukey: I suspect it's the same thing as in T340780, data-persistence is putting new dbproxies in service and the new hosts are not included in the k8s network policies
[07:45:33] <_joe_>	 yep
[07:45:43] <_joe_>	 better coordination would be welcome
[07:46:09] <elukey>	 taavi: ahhh nice, lemme check
[07:46:23] <_joe_>	 "nice" :P
[07:48:21] <elukey>	 created https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/934866
[07:53:10] <elukey>	 adding the new netpolicy, let's see
[07:55:45] <elukey>	 worked, https://toolhub.wikimedia.org/ is up
[08:07:41] <claime>	 _joe_: We got tagged on https://phabricator.wikimedia.org/T336627 do you know what's the jobs for account autocreation ? Is it CentralAuthCreateLocalAccountJob?
[08:08:16] <_joe_>	 I would guess so
[08:08:30] <taavi>	 yes
[08:12:37] <claime>	 Ok then they're right it took ~30 minutes at that time. It's since gotten a lot better, and we added quite a few servers to the jobrunners pool
[08:12:48] <claime>	 Maybe since that job is time critical it should have its own lane
[08:13:17] <_joe_>	 claime: and/or an alert
[08:13:45] <claime>	 I'll check that out
[08:59:13] <claime>	 I don't know how pertinent it is to give it its own lane, that's supposed to be for high traffic jobs, but it's not high-traffic, we just don't want it to wait
[09:00:08] <_joe_>	 it's a slippery slope
[09:00:55] <_joe_>	 now I'd say that we'd need an SLO for jobs, maybe with 3 different levels, and someone should decide on what latency is acceptable
[09:01:18] <_joe_>	 more to my point: if we want something to happen "guaranteed quick", it shouldn't be a job
[09:01:32] <_joe_>	 because we might have to turn off the jobqueue temporarily for any reason
[09:05:20] <claime>	 I know, but in the current state it is one
[09:05:37] <claime>	 And half an hour for account creation propagation is a bit much :/
[09:10:03] <taavi>	 do we have stats on how long the job execution takes? I'm wondering if it would be reasonable to do the auto-creations on the initial sign-up request
[09:10:32] <taavi>	 or hm. it's cross-wiki, so I don't think I can do that very easily
[09:27:58] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm)
[09:38:49] <wikibugs>	 10serviceops, 10All-and-every-Wikisource, 10Thumbor, 10MW-1.41-notes (1.41.0-wmf.13; 2023-06-13): Thumbor fails to render thumbnails of djvu/tiff/pdf files quite often in eqiad - https://phabricator.wikimedia.org/T337649 (10Ladsgroup) Yeah yeah, For the first 50 pages. The problem was that it was queuing a...
[09:50:20] <wikibugs>	 10serviceops, 10Prod-Kubernetes, 10Kubernetes, 10Patch-For-Review: Remove the .Values.kubernetesApi hack - https://phabricator.wikimedia.org/T326729 (10JMeybohm)
[09:53:42] <elukey>	 filed https://phabricator.wikimedia.org/T340950 as follow up for the toolhub incident
[09:58:43] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10fgiunchedi) + o11y-logging for visibility, the un-escaping I think will work here (not tested)
[10:08:54] <wikibugs>	 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10elukey) Adding another datapoint: T340950  Toolhub uses `m5-master.eqiad.wmnet`,...
[10:11:14] <wikibugs>	 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10Urbanecm_WMF) >>! In T340843#8984877, @elukey wrote: > Toolhub uses `m5-master.eq...
[10:22:25] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki)
[10:23:06] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki)
[10:55:27] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy: Set a limit to the number of allowed active connections via runtime key overload.global_downstream_max_connections - https://phabricator.wikimedia.org/T340955 (10JMeybohm)
[12:06:17] <wikibugs>	 10serviceops, 10Kubernetes: Integrate kube-metrics-server into our infrastructure - https://phabricator.wikimedia.org/T249929 (10BTullis)
[12:46:57] <wikibugs>	 10serviceops, 10Machine-Learning-Team, 10Platform Team Initiatives (API Gateway): Review LiftWing's usage of the API Gateway - https://phabricator.wikimedia.org/T340982 (10elukey)
[12:50:03] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) Could you please add some more information. e.g. I am not sure what exactly makes it invalid json. e.g.  ` In [1]: import json  In [2]: a=open('a.json', 'r')...
[13:42:38] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10Joe) You must be doing something to the string when loading it from file, because `\\` is not valid json:  ` astring = '{"a": "\\xe2"}' json.loads(astring) ... json.deco...
[13:50:15] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye
[14:13:04] <wikibugs>	 10serviceops, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Deploy wikidiff2 1.14.0 - https://phabricator.wikimedia.org/T340087 (10TheresNoTime)
[14:13:15] <wikibugs>	 10serviceops, 10Beta-Cluster-Infrastructure, 10wikidiff2, 10Better-Diffs-2023, 10Community-Tech (CommTech-Kanban): Install wikidiff2 1.14.0 deb on deployment-prep & test - https://phabricator.wikimedia.org/T340542 (10TheresNoTime) 05Open→03In progress >>! In T340542#8973199, @dom_walden wrote: > Ther...
[14:18:32] <wikibugs>	 10serviceops, 10MW-on-K8s, 10Observability-Logging: Some apache access logs are invalid json - https://phabricator.wikimedia.org/T340935 (10akosiaris) ah yeah, python will just DTRT and escape the backslashes properly
[14:25:58] <wikibugs>	 10serviceops, 10MediaWiki-extensions-CentralAuth, 10Stewards-and-global-tools, 10WMF-JobQueue, 10Patch-For-Review: Accounts taking 30+ minutes to autocreate on metawiki/loginwiki (2023-05) - https://phabricator.wikimedia.org/T336627 (10Clement_Goubert) These issues seem to have occurred before we added c...
[14:26:07] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jiji@cumin1001 for host kubestagemaster2002.codfw.wmnet with OS bullseye completed: - kube...
[15:09:52] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki)
[15:50:37] <wikibugs>	 10serviceops, 10SRE, 10Traffic, 10envoy, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) mw and restbase canaries as well as mathoid are running 1.23.10 since today. If nothing comes up I will roll the update out to the rest of the fleet tom...
[16:29:25] <wikibugs>	 10serviceops, 10Kubernetes, 10Patch-For-Review: Add a second control-plane to wikikube staging clusters - https://phabricator.wikimedia.org/T329827 (10jijiki)
[16:33:21] <wikibugs>	 10serviceops, 10Data-Persistence: WikiKube: Investigate how to abstract misc Mariadb clusters host/ip information so that no deployment of apps is needed when a master is failed over - https://phabricator.wikimedia.org/T340843 (10akosiaris) We 've discussed this in the #serviceops meeting today. We touched on...
[18:52:41] <duesen>	 claime, effie, akosiaris: The Job Queue is backed up by over two hours... are you you aware of this?
[18:53:32] <duesen>	 There was a massive spike in parsoid cache warming jobs being inserted at 12:30 UTC. 
[18:54:41] <claime>	 Huh ?
[18:55:21] <claime>	 Nope didn't see that, but you're lucky I was near my computer, it's 9PM here
[18:55:48] <duesen>	 claime: yes, here too :)
[18:56:01] <duesen>	 I was just about to head for the klaxon
[18:56:28] <duesen>	 I just realized that the wait time for individual jobs is different. So this only affects the parsoid jobs. still not good.
[18:57:17] <claime>	 yeah and it's not concurrency, or at least I don't think so, even if the graphs are mostly wrong we're nowhere near max
[18:57:35] <duesen>	 Hm.. the refreshlinks job (equivalent for the old parser) is backlogged by 4 hours - but it'S just *always* like that?
[18:57:54] <claime>	 jobrunners are fine
[18:58:33] <duesen>	 I am looking at https://grafana.wikimedia.org/d/LSeAShkGz/jobqueue?orgId=1&viewPanel=5
[18:59:22] <claime>	 refreshlinks has been backed up for a few days apparently
[18:59:38] <claime>	 it's more spiky before 30/6
[19:00:25] <duesen>	 right
[19:00:48] <duesen>	 Should we have alerts for this metrics? I have never used grafana alerts...
[19:00:58] <claime>	 you know what I'm gonna up the concurrency again
[19:01:15] <claime>	 We should fix the metrics first so they aggregatable, then we can make alerts
[19:01:19] <claime>	 This is in progress
[19:01:44] <duesen>	 I thought the metrics issue only afftected the concurrency, not wait time?
[19:01:53] <claime>	 refreshLinks is partitioned by mediawiki_database btw, and has a total concurrency of 240
[19:02:16] <claime>	 duesen: wait time is also a summary and not a histogram
[19:02:43] <duesen>	 hm, ok.
[19:03:49] <claime>	 I'll up concurrency again see if it does anything, if not we'll look into partitioning maybe
[19:04:10] <claime>	 but I don't think I can look into partitioning tonight, too much I don't know
[19:08:25] <claime>	 in CI
[19:11:46] <claime>	 duesen: deployed a bump to 150
[19:21:32] <claime>	 we're at least processing a bit faster, but insertion rate keeps rising so idk
[19:24:17] <claime>	 duesen: https://grafana.wikimedia.org/goto/PEefK1rVz?orgId=1
[19:24:22] <claime>	 That's a lot of messaging
[19:25:27] <duesen>	 wow
[19:25:43] <duesen>	 what does "messaging" mean here? what does each message correspond to?
[19:26:44] <claime>	 I think that's the number of kafka messages representing a job
[19:26:58] <claime>	 so effectively how many messages there are in backlog
[19:28:26] <claime>	 back of the envelope calculation says we should resolve it in around 2h
[19:29:00] <claime>	 1.6M messages/~250 ops/s
[19:29:02] <akosiaris>	 So, what happened in codfw and suddenly job insertion rate jumped from less than 100 to more than 250?
[19:29:15] <effie>	 we have been having an elevated error rate 
[19:29:21] <effie>	 generally speaking 
[19:29:28] <effie>	 https://www.wikimediastatus.net/
[19:30:35] <akosiaris>	 18 errors per second?
[19:30:40] <akosiaris>	 What?
[19:30:57] <effie>	 which is also evident here https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&viewPanel=63
[19:31:04] <effie>	 ok something else is going on 
[19:31:57] <claime>	 There's nothing in SAL around the time of that rise
[19:32:29] <effie>	 ok let's see NEL and logstash 
[19:32:36] <claime>	 looking at logstash
[19:32:38] <effie>	 I mean, NEL 
[19:32:43] <effie>	 looking at NEL 
[19:33:03] <akosiaris>	 https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&var-site=All&var-cluster=appserver&var-method=GET&var-code=200&var-php_version=All&from=now-12h&to=now&viewPanel=63
[19:33:09] <akosiaris>	 I don't see much here
[19:33:33] <claime>	 big exception spike around 1550
[19:33:38] <akosiaris>	 There is increase but and appears to coincide but it's not 18 errors per second 
[19:33:40] <claime>	 [{reqId}] {exception_url} JobQueueError: Could not enqueue jobs
[19:33:54] <claime>	 and then nothing much on mw backend errors
[19:35:39] <claime>	 still, a lot of jobqueue errors
[19:38:30] <effie>	 akosiaris: it could be a side effect rather than a coincidence 
[19:43:20] <duesen>	 I think what we are seeing may be a stampede effect triggered when a template changes that is used on pages that are viewed a lot.
[19:43:56] <duesen>	 When a "stale" page is viewed, we put a job on the queue to re-render it. And while we wait for that to happen, we put more jobs on the queue.
[19:44:27] <duesen>	 And the more jobs we put into the queue, the longer it takes for things to get rendered, so we end up putting more jobs on the queue...
[19:44:35] <claime>	 duesen: How long should a re-render job be considered "valid" ?
[19:44:41] <claime>	 We can put an expiry date on jobrs
[19:44:58] <duesen>	 conceptually, there is no time limit
[19:45:11] <claime>	 operationnaly speaking?
[19:45:52] <effie>	 duesen: while this sounds plausible, I would love to know which change caused it, I know it is a longshot ofc
[19:46:05] <claime>	 duesen: Backlog is decreasing after concurrency change btw
[19:46:11] <duesen>	 Currently, we'd have to look at the time it usually takes from loading a page to clicking edit
[19:46:17] <effie>	 claime: goo d call 
[19:46:33] <duesen>	 claime: thanky ou
[19:46:47] <claime>	 So maybe like 5 minutes ?
[19:47:11] <claime>	 It would probably cull some ?
[19:47:16] <duesen>	 Maybe. But... this isn't great forever. Also, more jobs will come. 
[19:47:35] <duesen>	 Every time someone views the stale page, a new job will be enqueued
[19:48:32] <duesen>	 We could put a flag into memcached, and check it before queueing more jobs...
[20:09:24] <duesen>	 effie: on enwiki, rc_id=1649965457 looks like it might fit the bill. It happened nearly exactly at 12:30, and its used 75220.
[20:09:36] <duesen>	 Though that should cause no more than 75221 jobs...
[20:10:10] <duesen>	 Interestingly, the template wasn't edited. This is a wikidata change. I'm not even sure it would cascade the pages that use the template
[20:10:22] <duesen>	 Note that I only checked enwiki. Might be somewhere else.
[20:10:31] <effie>	 thank you !
[20:10:38] <effie>	 could be, could be not, sigh 
[20:10:46] <duesen>	 nothing on commons at the time...
[20:11:07] <duesen>	 The template in question is https://en.wikipedia.org/wiki/Template%3ATick
[20:11:49] <claime>	 Damn ticks
[20:12:01] <claime>	 Ah, not these ticks
[20:16:41] <duesen>	 Hm... interesting... there is also a much used tick template in frwiki
[20:17:11] <duesen>	 https://fr.wikipedia.org/wiki/Mod%C3%A8le:Fait
[20:17:18] <duesen>	 Which also got a bump at the same time. From wikidata.
[20:17:30] <duesen>	 Not a coincidence: https://www.wikidata.org/w/index.php?title=Q6154014&action=history
[20:18:23] <duesen>	 But IIRC, a wikidata change that causes a template to be re-rendered doesn't cascade to pages that use that template... or maybe they do? I may even have written that code, eight years ago...
[20:18:47] <duesen>	 But this shouldn't happen for interwiki changes.
[20:20:52] <duesen>	 maybe it's just a red herring...
[20:26:40] <effie>	 duesen: lets pick this up tomorrow, the bandaid is working looks like 
[20:28:57] <duesen>	 effie: thank you!
[20:30:59] <effie>	 cheers :)