[00:04:19] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:20:31] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:39] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:14] (03PS1) 10Stang: logos: Error handling if pngquant's output is larger than original [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894129 (https://phabricator.wikimedia.org/T331177) [01:10:19] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 85781 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [01:12:09] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4044 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 388070 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:28:49] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 84671 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [01:30:39] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4044 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 386960 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [01:41:39] (03PS3) 10Andrew Bogott: OpenStack: collapse 'user' OpenStack role into 'reader' role [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759) [01:43:09] (03CR) 10Andrew Bogott: OpenStack: collapse 'user' OpenStack role into 'reader' role (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [01:43:13] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: collapse 'user' OpenStack role into 'reader' role [puppet] - 10https://gerrit.wikimedia.org/r/893545 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [01:51:46] (03CR) 10Andrew Bogott: [C: 03+2] OpenStack: rename 'projectadmin' role to 'member' role [puppet] - 10https://gerrit.wikimedia.org/r/893759 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [01:51:54] (03PS2) 10Andrew Bogott: OpenStack: rename 'projectadmin' role to 'member' role [puppet] - 10https://gerrit.wikimedia.org/r/893759 (https://phabricator.wikimedia.org/T330759) [02:09:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:15:46] (03CR) 10Krinkle: [C: 03+1] doc: Relax CSP rules for taint-check-demo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/680337 (https://phabricator.wikimedia.org/T257301) (owner: 10Daimona Eaytoy) [02:20:42] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover, and 2 others: March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T327920 (10Tgr) >>! In T327920#8652181, @akosiaris wrote: > And if the are being used as is from that page, that is. > > * e.g. `GET https://wikimedia.org/ap... [02:24:45] (JobUnavailable) firing: (2) Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:33:09] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 80810 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [02:36:51] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4044 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 382989 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [02:46:01] PROBLEM - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [02:47:45] RECOVERY - Cxserver LVS eqiad on cxserver.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [03:01:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:01:30] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-codfw.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:06:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:06:30] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-codfw.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [03:09:25] PROBLEM - Check systemd state on mwlog2002 is CRITICAL: CRITICAL - degraded: The following units failed: mw-log-cleanup.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:59:27] 10SRE-swift-storage, 10MediaWiki-File-management: `Filebackend::Multiwrite`, multi-dc and thumbnail handling - https://phabricator.wikimedia.org/T331138 (10Peachey88) [05:49:59] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 118 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:07:19] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4040 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 67960 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [06:09:11] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4040 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 370248 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:09:39] !log started rsync of xmldatadumps/public from dumpsdata1001 in screen session as ariel on that host, to dumpsdata1006, no bandwidth cap [06:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:33] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 67226 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [06:21:25] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 369514 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [06:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:16:59] PROBLEM - WDQS SPARQL on wdqs1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:17:07] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 63772 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [07:18:57] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 366062 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:07:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [08:16:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:20:03] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 59996 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [08:21:55] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 362284 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:21:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:47:45] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4049 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [08:49:35] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4049 is OK: SSL OK - OCSP staple validity for wikipedia.org has 259824 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [08:52:29] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 119 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [09:23:39] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4043 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 56181 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [09:25:29] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4043 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 358470 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/HTTPS [09:25:39] RECOVERY - Check systemd state on dumpsdata1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:09] PROBLEM - Check systemd state on dumpsdata1006 is CRITICAL: CRITICAL - degraded: The following units failed: cleanup_xmldumps.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:00:03] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:05:37] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:09:07] PROBLEM - SSH on bast6002 is CRITICAL: Server answer: Exceeded MaxStartups https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:10:59] RECOVERY - SSH on bast6002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:30:33] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10C933103) Is it an intended effect of this ticket or is it a bug that now when I access https://maps.wikimedia.org/ from browser direct... [13:05:40] 10SRE, 10Maps, 10Product-Infrastructure-Team-Backlog-Deprecated, 10Traffic: Limit maps serving to Wikimedia hosted sites only - https://phabricator.wikimedia.org/T261424 (10Aklapper) @C933103: This ticket has been closed for 29 months. Please file separate tickets for separate topics. Thanks. (Plus I canno... [13:39:19] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [13:41:11] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 238729 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [14:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:31:44] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: (no justification provided) [14:32:30] (Traffic bill over quota) firing: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:32:30] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: (no justification provided) (duration: 00m 46s) [14:35:53] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: Updating member dashboard to reflect new role names -- T330759 [14:35:59] T330759: Modernize openstack rbac - https://phabricator.wikimedia.org/T330759 [14:44:50] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: Updating member dashboard to reflect new role names -- T330759 (duration: 08m 56s) [14:44:56] T330759: Modernize openstack rbac - https://phabricator.wikimedia.org/T330759 [14:52:30] (Traffic bill over quota) resolved: Alert for device cr1-eqiad.wikimedia.org - Traffic bill over quota got better - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:53:56] !log andrew@deploy1002 Started deploy [horizon/deploy@9d02cd6]: Updating member dashboard to reflect new role names -- T330759 [14:54:02] T330759: Modernize openstack rbac - https://phabricator.wikimedia.org/T330759 [14:56:10] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:56:13] !log andrew@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: Updating member dashboard to reflect new role names -- T330759 (duration: 02m 17s) [14:56:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:57:50] With phone [14:59:04] andrewbogott: I think you the deploy paged around 40 or so ppl [14:59:20] The deploy you did* [14:59:26] I was about to say [14:59:30] Dammit [14:59:37] it succeeded, why did it page? [14:59:48] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:54] And is there anything I need to do to ack? [15:00:02] <_joe_> !incidents [15:00:03] 3445 (UNACKED) ProbeDown (10.2.2.40 ip4 labweb-ssl:7443 probes/service http_labweb-ssl_ip4 eqiad) [15:00:03] 3444 (RESOLVED) Primary outbound port utilisation over 80% (paged) global (cr2-codfw.wikimedia.org) [15:00:07] <_joe_> !ack 3445 [15:00:07] 3445 (ACKED) ProbeDown (10.2.2.40 ip4 labweb-ssl:7443 probes/service http_labweb-ssl_ip4 eqiad) [15:00:14] <_joe_> andrewbogott: done :) [15:00:20] thank you, and very sorry for the noise [15:01:18] (ProbeDown) resolved: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:01:49] So it wasn't the deploy that paged, it was some intermittent step where lvm saw a down host [15:01:51] ? [15:02:13] I guess we can want until after the weekend to post-mortem :/ [15:02:17] *wait [15:02:59] Sure, but maybe avoid actions that can lead to paging 40 ppl during weekends? [15:03:19] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [15:04:11] akosiaris: that's fair, although the fact that that service is capable of paging people at all surprises me. I will try to figure out how to limit who hears about that service. [15:04:27] We good then ? [15:04:47] yeah, there is definitely no actual problem [15:05:03] k, back to beatmaking [15:05:09] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 237290 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:08:03] andrewbogott: it would be nice if you could do that, my thanks. [15:08:45] noteably it didn't page /me/ so something is definitely askew :/ [15:09:55] Wow, that's wrong definitely, we need to fix that [15:11:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:15:03] T331197 [15:15:04] T331197: Horizon/lvm alerts the wrong people (and also is generally too sensitive) - https://phabricator.wikimedia.org/T331197 [15:15:19] (03PS1) 10Zabe: Add logo for azwikimedia and vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894143 (https://phabricator.wikimedia.org/T331177) [15:15:51] hm, except lvs and not lvm [15:16:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:23:30] (03CR) 10Zabe: [C: 03+2] "Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894129 (https://phabricator.wikimedia.org/T331177) (owner: 10Stang) [15:24:12] (03Merged) 10jenkins-bot: logos: Error handling if pngquant's output is larger than original [mediawiki-config] - 10https://gerrit.wikimedia.org/r/894129 (https://phabricator.wikimedia.org/T331177) (owner: 10Stang) [17:27:39] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [17:29:31] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4044 is OK: SSL OK - OCSP staple validity for wikipedia.org has 228629 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [18:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:58:37] PROBLEM - HAProxy HTTPS wikipedia.org RSA on cp4037 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [19:02:21] RECOVERY - HAProxy HTTPS wikipedia.org RSA on cp4037 is OK: SSL OK - OCSP staple validity for wikipedia.org has 219458 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2023-05-24 07:09:36 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:01:53] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4039 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 57487 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [20:05:35] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4039 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 489263 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 25 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:41:23] PROBLEM - HAProxy HTTPS wikiworkshop.org ECDSA on cp4041 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 15517 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [20:43:09] RECOVERY - HAProxy HTTPS wikiworkshop.org ECDSA on cp4041 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 317811 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (ECDSA) valid until 2023-03-30 14:08:29 +0000 (expires in 25 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:44:19] PROBLEM - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:did not receive the required stapled OCSP response https://wikitech.wikimedia.org/wiki/HTTPS [20:46:05] RECOVERY - HAProxy HTTPS wikipedia.org ECDSA on cp4039 is OK: SSL OK - OCSP staple validity for wikipedia.org has 216834 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2023-05-24 08:07:08 +0000 (expires in 80 days) https://wikitech.wikimedia.org/wiki/HTTPS [20:47:37] PROBLEM - Query Service HTTP Port on wdqs1006 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 298 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:02:21] PROBLEM - HAProxy HTTPS wikiworkshop.org RSA on cp4040 is CRITICAL: SSL CRITICAL - OCSP staple validity for wikiworkshop.org has 50259 seconds left https://wikitech.wikimedia.org/wiki/HTTPS [22:02:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T330317 (10phaultfinder) [22:04:13] RECOVERY - HAProxy HTTPS wikiworkshop.org RSA on cp4040 is OK: SSL OK - OCSP staple validity for wikiworkshop.org has 482146 seconds left:Certificate wikiworkshop.org contains all required SANs:Certificate wikiworkshop.org (RSA) valid until 2023-03-30 14:08:36 +0000 (expires in 25 days) https://wikitech.wikimedia.org/wiki/HTTPS [22:24:45] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable