[00:39:19] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923581
[00:39:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923581 (owner: 10TrainBranchBot)
[00:55:40] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923581 (owner: 10TrainBranchBot)
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:26:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:35:51] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 163 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:41:23] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 8 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:55:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[03:57:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[04:00:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:05:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:08:51] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 211 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:09:05] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 155 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:10:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[04:14:09] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:14:35] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[04:34:53] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:45:47] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:50:09] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ldap-ro_389: Servers ldap-replica1003.wikimedia.org are marked down but pooled: ldap-ro-ssl_636: Servers ldap-replica1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[06:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[06:12:07] <marostegui>	 !log Change innodb_fast_shutdown to 0 on db1154 before downgrading T337446
[06:12:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:12] <stashbot>	 T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446
[06:20:50] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) 05Resolved→03Open https://upload.wikimedia.beta.wmfla...
[06:22:50] <wikibugs>	 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz)
[06:29:01] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 943.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:29:51] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 993.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[06:35:04] <marostegui>	 ^ known
[06:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[07:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230528T0700)
[07:00:05] <jouncebot>	 Deploy window No deploys all day (Per Deployments/Yearly_calendar)! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230528T0700)
[07:10:37] <RhinosF1>	 jouncebot: refresh
[07:10:38] <jouncebot>	 I refreshed my knowledge about deployments.
[07:10:43] <RhinosF1>	 jouncebot: now
[07:10:43] <jouncebot>	 For the next 23 hour(s) and 49 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230528T0700)
[07:10:48] <RhinosF1>	 jouncebot: next
[07:10:48] <jouncebot>	 In 23 hour(s) and 49 minute(s): No deploys all day (Per Deployments/Yearly_calendar)! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230529T0700)
[07:10:56] <RhinosF1>	 Yes that's better
[07:47:51] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 99 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:48:15] <icinga-wm>	 PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:50:29] <icinga-wm>	 PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:50:45] <icinga-wm>	 PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:54:35] <icinga-wm>	 PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:55:05] <icinga-wm>	 RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:55:23] <icinga-wm>	 RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[07:56:05] <icinga-wm>	 RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[07:57:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[07:58:57] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[07:59:19] <icinga-wm>	 RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[11:07:21] <xover>	 I'm seeing a lot of 429 (too many requests) responses on Commons thumbnails today, with no obvious reason I should be being ratelimited.
[11:08:07] <xover>	 Example thumb: https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Livy_%281876%29.djvu/page9-296px-Livy_%281876%29.djvu.jpg
[11:10:09] <xover>	 I would not necessarily have noticed this problem for a couple of days so it could conceivably be related to last week's train.
[11:12:13] <xover>	 Request source would geolocate to Europe to the degree that's deterministic; but the IP should not be shared so it's unlikely I'm sharing with a bot or something.
[11:29:02] <wikibugs>	 (03PS15) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506
[11:30:46] <xover>	 I'm now getting multiple reports on-wiki from people with similar problems.
[11:31:27] <xover>	 One reports increasingly slow image loads over the last few days, culminating in what appears like completely broken now.
[11:31:51] <xover>	 All so far are most likely coming from European IPs.
[11:31:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede)
[11:33:03] <xover>	 Anybody able to do some basic sanity checks on logs / server status? Batphone time?
[11:34:28] <wikibugs>	 (03PS16) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506
[11:36:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede)
[11:40:39] <wikibugs>	 (03PS17) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506
[11:42:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede)
[11:44:48] <xover>	 Hmm. Problem /may/ be limited to thumbs extracted from PDF/DjVu files; which means this could be related to shellbox servers or job queue rather than Thumbor/Swift-type components.
[11:45:03] <wikibugs>	 (03PS18) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506
[11:47:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede)
[11:47:41] <_joe_>	 xover: thumbnailing of those files is still handled by thumbor, and I see an increase in the 429s on may 24th 
[11:48:22] <_joe_>	 I don't think it's worth paging people over,though, because it's not a steep increase or making thumbnailing not work
[11:48:29] <xover>	 Indeed, scrolling through the first few hundred entries on Special:NewFiles on Commons shows all PDF/DjVu thumbs broken, but only those thumbs.
[11:48:57] <xover>	 Well, it leaves the core workflow on Wikisource completely broken.
[11:49:10] <_joe_>	 so I suspect it's a problem with thumbor 
[11:49:51] <xover>	 (core work is transcribing book scans; without seeing page scans that's dead in the water)
[11:50:10] <wikibugs>	 (03PS19) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506
[11:50:50] <_joe_>	 xover: do you think you can pinpoint exactly when that happened
[11:51:09] <_joe_>	 xover: yeah I get it :/ I'm trying to see if I can find out more
[11:51:21] <xover>	 Note that extracting individual pages from PDF/DjVu shells out to ghostscript/DjVuLibre, so I still think shellboxes is more likely than thumbor as such.
[11:51:45] <xover>	 Best timing info I have is: "One reports increasingly slow image loads over the last few days, culminating in what appears like completely broken now."
[11:52:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede)
[11:55:39] <_joe_>	 xover: yeah I am saying that thumbnailing is 100% handled by thumbor
[11:55:47] <_joe_>	 including djvu and tiff images
[11:57:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[12:00:34] <_joe_>	 xover: can you give me an example of a broken thumbnail?
[12:00:48] <xover>	 Example thumb: https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Livy_%281876%29.djvu/page9-296px-Livy_%281876%29.djvu.jpg
[12:01:56] <_joe_>	 nevermind, found one
[12:02:11] <xover>	 I discovered it on https://commons.wikimedia.org/wiki/File:Livy_(1876).djvu and have been trouble loading most thumbs from that file, but some of them do load some of the time.
[12:02:31] <xover>	 *have been having trouble loading ...
[12:04:43] <_joe_>	 can you pinpoint when this happened?
[12:05:00] <_joe_>	 or point me to the on-wiki discussion
[12:05:09] <_joe_>	 I am not sure I'm able to do something about this
[12:08:55] <_joe_>	 yeah, I'm seeing thumbor returning consistently 429s on djvu images
[12:11:55] <xover>	 _joe_: https://en.wikisource.org/wiki/Wikisource:Scriptorium#ocrtoy-no-text
[12:12:12] <xover>	 _joe_: https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Changes_in_the_past_week,_affect_the_Page_namespace.
[12:14:05] <_joe_>	 xover: ok, I think I know where the problem is; not sure how to solve it now, though. I will open a task
[12:14:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:19:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:23:20] <wikibugs>	 (03PS6) 10Zabe: Change project logo for Wikimania to Wikimania 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (https://phabricator.wikimedia.org/T337044) (owner: 10Robertsky)
[12:27:47] <xover>	 _joe_: It seems improbable to me that client request rate-limiting is what's actually happening. To me it seems much more likely that it's a timeout (extracting page images takes too long), or a CPU quota being exceeded, or an exception causing backend jobs to respawn too frequently, or... i.e. That Thumbor's 429 statuses are a follow-on symptom rather than the direct place the problem is happening.
[12:29:18] <_joe_>	 xover: https://phabricator.wikimedia.org/T337649 I'm putting there what I find out
[12:29:36] <xover>	 Thanks. And, yes, I'm watching that.
[12:29:52] <_joe_>	 at this point I think this is traffic induced
[12:47:41] <_joe_>	 we had two peaks on the 25th and today, I think mostly traffic induced. I'm not inclined to change anything right now, and I don't see a short-term solution. I'll get the people working on thumbor to look at this tomorrow as well
[13:16:45] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync
[13:16:49] <logmsgbot>	 !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync
[13:17:50] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync
[13:18:21] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes1013.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1010.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[13:19:00] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync
[13:20:19] <_joe_>	 ok it was slightly terrifying for a sec but I suspect I might just solved the issue
[13:20:33] <_joe_>	 xover: can you check if you still see the issue?
[13:22:00] <xover>	 _joe_: Still somewhat slow, but I haven't seen any 429 in the first 5 thumbs I just checked. I'll keep checking.
[13:27:57] <xover>	 _joe_: Yeah, more testing supports 429s are gone, and performance is at least within a comparable range to normal (i.e. way too slow, but still the baseline). I've asked the other user reporting this to retest too.
[13:31:19] <_joe_>	 xover: i hope i've solved at least the immediate issue
[13:35:16] <xover>	 Thank you!
[14:06:32] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:11:32] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:16:32] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[15:33:23] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:38:13] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:44:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:47:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver POST/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee
[15:49:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (3) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:52:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver POST/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc
[15:57:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[16:29:09] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:30:53] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[18:03:42] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Make wikimedia-dcw mailing list private! - https://phabricator.wikimedia.org/T337644 (10Ladsgroup) Any admin can do it themselves.
[18:25:12] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Make wikimedia-dcw mailing list private! - https://phabricator.wikimedia.org/T337644 (10Aklapper) 05Open→03Invalid @TheAafi: If this is about non-public archives, please proceed at https://lists.wikimedia.org/postorius/lists/wikimedia-dcw.lists.wikimedia.org/settings/archi...
[19:29:33] <icinga-wm>	 PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /
[19:29:33] <icinga-wm>	 edia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase
[19:29:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:30:09] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[19:30:25] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:30:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:30:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:30:33] <icinga-wm>	 PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for Januar
[19:30:33] <icinga-wm>	 6 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:30:37] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[19:31:01] <icinga-wm>	 RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase
[19:31:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:31] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
[19:31:53] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:31:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:33:33] <icinga-wm>	 RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds
[19:35:55] <icinga-wm>	 PROBLEM - Restbase root url on restbase1027 is CRITICAL: connect to address 10.64.48.183 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[19:46:25] <icinga-wm>	 PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
[19:57:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[23:57:01] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale