[00:39:19] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923581 [00:39:25] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923581 (owner: 10TrainBranchBot) [00:55:40] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/923581 (owner: 10TrainBranchBot) [02:06:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:35:51] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 163 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:41:05] (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:41:23] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 8 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:11:05] (SwiftTooManyMediaUploads) resolved: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:55:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [03:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [04:00:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:05:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:08:51] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 211 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:09:05] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 155 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:10:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:14:09] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:14:35] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 6 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [04:34:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:47] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:50:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - ldap-ro_389: Servers ldap-replica1003.wikimedia.org are marked down but pooled: ldap-ro-ssl_636: Servers ldap-replica1003.wikimedia.org are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [06:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [06:12:07] !log Change innodb_fast_shutdown to 0 on db1154 before downgrading T337446 [06:12:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:12] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [06:20:50] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) 05Resolved→03Open https://upload.wikimedia.beta.wmfla... [06:22:50] 10SRE, 10Beta-Cluster-Infrastructure, 10Quality-and-Test-Engineering-Team (QTE), 10Traffic, and 3 others: [epic] The SSL certificate for Beta cluster domains fails to properly renew & deploy - https://phabricator.wikimedia.org/T293585 (10AlexisJazz) [06:29:01] PROBLEM - MariaDB Replica Lag: s3 on clouddb1013 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 943.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:29:51] PROBLEM - MariaDB Replica Lag: s3 on clouddb1017 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 993.96 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:35:04] ^ known [06:41:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230528T0700) [07:00:05] Deploy window No deploys all day (Per Deployments/Yearly_calendar)! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230528T0700) [07:10:37] jouncebot: refresh [07:10:38] I refreshed my knowledge about deployments. [07:10:43] jouncebot: now [07:10:43] For the next 23 hour(s) and 49 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230528T0700) [07:10:48] jouncebot: next [07:10:48] In 23 hour(s) and 49 minute(s): No deploys all day (Per Deployments/Yearly_calendar)! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230529T0700) [07:10:56] Yes that's better [07:47:51] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 99 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:48:15] PROBLEM - IPv4 ping to codfw on ripe-atlas-codfw is CRITICAL: CRITICAL - failed 49 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:50:29] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:50:45] PROBLEM - OSPF status on cr3-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:54:35] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS1299/IPv4: Connect - Telia https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:55:05] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:55:23] RECOVERY - OSPF status on cr3-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:56:05] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 109, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:58:57] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 43 probes of 710 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [07:59:19] RECOVERY - IPv4 ping to codfw on ripe-atlas-codfw is OK: OK - failed 4 probes of 795 (alerts on 35) - https://atlas.ripe.net/measurements/32390538/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [11:07:21] I'm seeing a lot of 429 (too many requests) responses on Commons thumbnails today, with no obvious reason I should be being ratelimited. [11:08:07] Example thumb: https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Livy_%281876%29.djvu/page9-296px-Livy_%281876%29.djvu.jpg [11:10:09] I would not necessarily have noticed this problem for a couple of days so it could conceivably be related to last week's train. [11:12:13] Request source would geolocate to Europe to the degree that's deterministic; but the IP should not be shared so it's unlikely I'm sharing with a bot or something. [11:29:02] (03PS15) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:30:46] I'm now getting multiple reports on-wiki from people with similar problems. [11:31:27] One reports increasingly slow image loads over the last few days, culminating in what appears like completely broken now. [11:31:51] All so far are most likely coming from European IPs. [11:31:52] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:33:03] Anybody able to do some basic sanity checks on logs / server status? Batphone time? [11:34:28] (03PS16) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:36:37] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:40:39] (03PS17) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:42:49] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:44:48] Hmm. Problem /may/ be limited to thumbs extracted from PDF/DjVu files; which means this could be related to shellbox servers or job queue rather than Thumbor/Swift-type components. [11:45:03] (03PS18) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:47:11] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:47:41] <_joe_> xover: thumbnailing of those files is still handled by thumbor, and I see an increase in the 429s on may 24th [11:48:22] <_joe_> I don't think it's worth paging people over,though, because it's not a steep increase or making thumbnailing not work [11:48:29] Indeed, scrolling through the first few hundred entries on Special:NewFiles on Commons shows all PDF/DjVu thumbs broken, but only those thumbs. [11:48:57] Well, it leaves the core workflow on Wikisource completely broken. [11:49:10] <_joe_> so I suspect it's a problem with thumbor [11:49:51] (core work is transcribing book scans; without seeing page scans that's dead in the water) [11:50:10] (03PS19) 10Slyngshede: WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 [11:50:50] <_joe_> xover: do you think you can pinpoint exactly when that happened [11:51:09] <_joe_> xover: yeah I get it :/ I'm trying to see if I can find out more [11:51:21] Note that extracting individual pages from PDF/DjVu shells out to ghostscript/DjVuLibre, so I still think shellboxes is more likely than thumbor as such. [11:51:45] Best timing info I have is: "One reports increasingly slow image loads over the last few days, culminating in what appears like completely broken now." [11:52:18] (03CR) 10CI reject: [V: 04-1] WIP P:netbox reconfigure to used OIDC [puppet] - 10https://gerrit.wikimedia.org/r/922506 (owner: 10Slyngshede) [11:55:39] <_joe_> xover: yeah I am saying that thumbnailing is 100% handled by thumbor [11:55:47] <_joe_> including djvu and tiff images [11:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [12:00:34] <_joe_> xover: can you give me an example of a broken thumbnail? [12:00:48] Example thumb: https://upload.wikimedia.org/wikipedia/commons/thumb/6/62/Livy_%281876%29.djvu/page9-296px-Livy_%281876%29.djvu.jpg [12:01:56] <_joe_> nevermind, found one [12:02:11] I discovered it on https://commons.wikimedia.org/wiki/File:Livy_(1876).djvu and have been trouble loading most thumbs from that file, but some of them do load some of the time. [12:02:31] *have been having trouble loading ... [12:04:43] <_joe_> can you pinpoint when this happened? [12:05:00] <_joe_> or point me to the on-wiki discussion [12:05:09] <_joe_> I am not sure I'm able to do something about this [12:08:55] <_joe_> yeah, I'm seeing thumbor returning consistently 429s on djvu images [12:11:55] _joe_: https://en.wikisource.org/wiki/Wikisource:Scriptorium#ocrtoy-no-text [12:12:12] _joe_: https://en.wikisource.org/wiki/Wikisource:Scriptorium/Help#Changes_in_the_past_week,_affect_the_Page_namespace. [12:14:05] <_joe_> xover: ok, I think I know where the problem is; not sure how to solve it now, though. I will open a task [12:14:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:19:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:23:20] (03PS6) 10Zabe: Change project logo for Wikimania to Wikimania 2023 version [mediawiki-config] - 10https://gerrit.wikimedia.org/r/921610 (https://phabricator.wikimedia.org/T337044) (owner: 10Robertsky) [12:27:47] _joe_: It seems improbable to me that client request rate-limiting is what's actually happening. To me it seems much more likely that it's a timeout (extracting page images takes too long), or a CPU quota being exceeded, or an exception causing backend jobs to respawn too frequently, or... i.e. That Thumbor's 429 statuses are a follow-on symptom rather than the direct place the problem is happening. [12:29:18] <_joe_> xover: https://phabricator.wikimedia.org/T337649 I'm putting there what I find out [12:29:36] Thanks. And, yes, I'm watching that. [12:29:52] <_joe_> at this point I think this is traffic induced [12:47:41] <_joe_> we had two peaks on the 25th and today, I think mostly traffic induced. I'm not inclined to change anything right now, and I don't see a short-term solution. I'll get the people working on thumbor to look at this tomorrow as well [13:16:45] !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: sync [13:16:49] !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: sync [13:17:50] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: sync [13:18:21] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes1013.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1011.eqiad.wmnet, kubernetes1010.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:19:00] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: sync [13:20:19] <_joe_> ok it was slightly terrifying for a sec but I suspect I might just solved the issue [13:20:33] <_joe_> xover: can you check if you still see the issue? [13:22:00] _joe_: Still somewhat slow, but I haven't seen any 429 in the first 5 thumbs I just checked. I'll keep checking. [13:27:57] _joe_: Yeah, more testing supports 429s are gone, and performance is at least within a comparable range to normal (i.e. way too slow, but still the baseline). I've asked the other user reporting this to retest too. [13:31:19] <_joe_> xover: i hope i've solved at least the immediate issue [13:35:16] Thank you! [14:06:32] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:32] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:16:32] (JobUnavailable) resolved: (2) Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:23] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:38:13] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:44:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:47:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad appserver POST/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExcee [15:49:16] (MediaWikiHighErrorRate) resolved: (3) Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:52:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad appserver POST/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver&var-method=POST - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExc [15:57:00] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [16:29:09] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:30:53] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:03:42] 10SRE, 10Wikimedia-Mailing-lists: Make wikimedia-dcw mailing list private! - https://phabricator.wikimedia.org/T337644 (10Ladsgroup) Any admin can do it themselves. [18:25:12] 10SRE, 10Wikimedia-Mailing-lists: Make wikimedia-dcw mailing list private! - https://phabricator.wikimedia.org/T337644 (10Aklapper) 05Open→03Invalid @TheAafi: If this is about non-public archives, please proceed at https://lists.wikimedia.org/postorius/lists/wikimedia-dcw.lists.wikimedia.org/settings/archi... [19:29:33] PROBLEM - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: / [19:29:33] edia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [19:29:35] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:30:09] PROBLEM - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:30:25] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:30:29] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:30:31] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:30:33] PROBLEM - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for Januar [19:30:33] 6 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [19:30:37] PROBLEM - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:31:01] RECOVERY - Restbase LVS eqiad on restbase.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [19:31:03] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:31] PROBLEM - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [19:31:53] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:55] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:31:57] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:33:33] RECOVERY - wikifeeds eqiad on wikifeeds.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [19:35:55] PROBLEM - Restbase root url on restbase1027 is CRITICAL: connect to address 10.64.48.183 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [19:46:25] PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring [19:57:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:57:01] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale