[00:03:56] PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: discard_held_messages.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:06:12] (03PS1) 10Jbond: django-sso: improve debug page [puppet] - 10https://gerrit.wikimedia.org/r/869857 [00:07:11] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [00:08:51] (03PS5) 10Jbond: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [00:09:08] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [00:10:44] (03PS3) 10Jbond: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [00:26:47] (03CR) 10Jbond: Add vendored module bodgit/puppet-postfix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [00:46:28] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:51:14] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:00:23] 10SRE, 10Traffic, 10Performance-Team (Radar): Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434 (10Krinkle) [01:40:45] (JobUnavailable) firing: (9) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:05:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:45] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:02] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:20] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:48] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:46:18] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:37:06] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:48:18] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:25] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869260 [05:21:54] (03CR) 10CI reject: [V: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869260 (owner: 10PipelineBot) [05:23:32] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:33:02] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:20] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:37:50] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:14:42] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_tegola:prod.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:29:02] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:33:40] PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:34:20] (03PS1) 10Marostegui: analytics_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869867 (https://phabricator.wikimedia.org/T325154) [06:34:59] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [06:38:39] (03CR) 10Marostegui: [C: 03+2] analytics_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869867 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui) [06:39:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:43:05] (03PS1) 10KartikMistry: WIP: Enable Content Translation/Section Translation on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870080 (https://phabricator.wikimedia.org/T325714) [07:57:48] PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:58:33] (03CR) 10Slyngshede: [C: 03+2] C:ldap::management use bitu-ldap from add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/869824 (owner: 10Slyngshede) [08:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221221T0800) [08:06:54] PROBLEM - SSH on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:50] PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:17:34] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869776 (owner: 10Slyngshede) [08:23:30] PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:26:50] PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:28:42] PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [08:28:46] PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:19] !log Downtiming wdqs 20[09-12] until 2023-01-02 (these are new hosts not yet properly brought into service) [08:32:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:47:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [08:51:04] RECOVERY - Check systemd state on wdqs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:51:12] (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [08:51:36] RECOVERY - SSH on wdqs2009 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:51:53] (03PS2) 10Elukey: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) [08:51:55] (03PS5) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [08:51:57] (03PS4) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [08:53:17] (03CR) 10Elukey: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [08:53:44] RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:53:50] (03PS3) 10Elukey: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) [08:53:52] (03PS6) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [08:53:54] (03PS5) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [08:55:23] (03CR) 10Elukey: [C: 03+2] sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [08:57:08] RECOVERY - Check systemd state on wdqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:17] (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [09:00:18] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:01:14] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:02:03] Emperor: I think we're going to get paged soon [09:02:16] (03PS7) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [09:02:18] (03PS6) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [09:02:24] (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [09:03:18] (ProbeDown) firing: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:03:52] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 5810 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:03:57] there we go [09:04:01] jayme: indeed so :-/ [09:04:12] acked [09:05:04] slightly odd set of things to be alerting [09:05:17] checking shellbox [09:05:50] big request spike in eqiad [09:06:00] +70req/s [09:06:22] RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:07:36] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6935 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:08:12] !log increasing replicas of shellbox-syntaxhighlight from 12 to 50 [09:08:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:18] (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:08:38] jayme: https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox&var-release=main&from=now-3h&to=now isn't showing me a request spike, where should I be looking? [09:09:02] Emperor: select shellbox-syntaxhighlight at the top [09:09:08] there are a bunch of shellboxes [09:09:12] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [09:09:17] !log correction: increasing replicas of shellbox-syntaxhighlight from 12 to 40 [09:09:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:10:16] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:10:18] (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:10:19] but looks like the request volume came back down on it's own again before I scaled up [09:11:01] (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 209k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [09:12:23] jayme: reqs still on the high side per grafana; worth trying to find out why, or see if it subsides given we're now managing to service those requests? [09:13:15] I think it's worth it finding out what's happening [09:16:01] (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 209k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig [09:16:22] the CirrusSearchJobQueueBacklogTooBig alert is related to mw jobrunners and job processing times are higher (mean from 300ms to almost 1s) so not specific to api_server perhaps? [09:16:38] job times are decreasing now [09:16:51] Hm, went looking in logstash for kubernetes.namespace_name:"shellbox-syntaxhighlight" but that's not actually any use because it filters out the 200s [09:16:52] dcausse: thanks [09:17:53] Emperor: I had assumed an edit spike [09:19:41] (03PS7) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [09:20:00] (03CR) 10Elukey: "Thanks a lot for the review :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [09:21:36] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [09:22:09] jayme: would sound plausible, but our edits/s graph looks unremarkable [09:22:46] indeed [09:22:49] Shellbox has had an issue before where preview has been reparsing too often [09:23:20] I believe it's long fixed but not only edits will trigger it [09:24:48] (03PS8) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [09:24:58] RhinosF1: right [09:25:31] Emperor: the big spike seems to have origined from the jobrunners https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=shellbox-syntaxhighlight [09:26:03] so probably related to the cirrussearch backlog after all [09:27:01] this cirrus backlog correlates with page re-renders (edits+template change propagation) [09:28:11] makes sense [09:29:02] possibly a popular template with some syntaxhighlighting tag got edited? [09:29:23] api servers also issued way more shellbox requests during that period...that seems kind of unexpected as jobrunners should be handling those requests themselves, no? [09:29:39] dcausse: maybe. no idea how to figure that out tbh [09:34:56] Emperor: we still have quite elevated latency in eqiad according to https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-3h&to=now [09:35:55] might as well be europe getting up though [09:37:17] jayme: yeah, if you expand to last 12 or 24 h, it's not outside our normal range [09:37:24] ack [09:38:34] So beyond the slightly vexing question of what caused the spike (d.causse's theory seems sound but I've no idea where we'd find it), I think we're good again [09:41:28] well, syntaxhighligt requests are still above normal rate. maybe that's still jobrunner backlog getting processed... [09:41:42] RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:46:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:49:12] backlog (at least the cirrus one is absorbed now) https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=eqiad.mediawiki.job.cirrusSearchLinksUpdate&var-consumer_group=All but p99 of this job are great tho (flat around 40sec perhaps hitting a timeout?) [09:49:23] s/great/not great/ [09:51:07] (03CR) 10Jaime Nuche: [C: 03+1] admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [09:51:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [09:53:56] hmm [09:54:46] I'm going to scale shellbox back down to 12 for now (as that side effect seems fine) [09:55:17] !log scaling shellbox-syntaxhighlight back to 12 replicas [09:55:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:55:22] seems sensible [10:01:18] dcausse: I'm not super familiar but from the envoy metrics of jobrunners it seems that wdqs is kinda slow [10:01:28] and reponding with more errors than usual [10:01:41] https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=wdqs-internal&from=now-3h&to=now [10:07:45] jayme: indeed... looking [10:08:03] <3 [10:10:04] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:10:58] jayme: expanding the time range it appears to be bit more usual https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=wdqs-internal&from=now-2d&to=now [10:13:24] reason might be related to search usage by expert users (e.g. searching for deepcat:A_Category will call wdqs-internal and possibly traverse a huge category graph that might timeout) [10:14:54] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:17] hm scratch this idea it's from jobrunners so it's related to wikidata constraint checks [10:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:30] it seems the p99 job runner duration has declined as well. It's also a 1h max, so it probably stays at it's max for 60min even if there was a decline (at least that's what I understand) [10:23:17] oh right, makes sense [10:24:13] Okay. Let's call it closed then. We should still create an incident report as something like that is bound to happen again [10:24:16] Emperor: I have to run a quick errand, no longer than 15min [10:26:03] Lucas_WMDE: do you know if we collect some metrics regarding wikidata constraint checks (esp. the job constraintRunCheck which I believe talks to wdqs-internal)? [10:26:45] j.ayme: ack [10:27:10] dcausse: let me see [10:27:50] https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&refresh=30s might have some useful metrics [10:27:57] especially the SPARQL section, I guess [10:28:01] nothing urgent but just wondering if we should worry about the weird patterns we see here: https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=wdqs-internal&from=now-2d&to=now [10:28:03] Lucas_WMDE: yes [10:28:33] looks like there were enough queries to get WBQC throttled [10:29:11] which doesn’t usually seem to happen https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&refresh=30s&viewPanel=26&from=now-90d&to=now [10:30:23] I think I might file a task something seems to degrade [10:30:29] nothing stands out in the wbcheckconstraints API requests though https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&from=now-2d&to=now&var-metric=p95&var-module=wbcheckconstraints [10:31:10] nor in the constraintsRunCheck jobs afaict https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=constraintsRunCheck&var-dc=eqiad%20prometheus%2Fk8s [10:31:37] ah, no, the job backlog time there got a bit backlogged, with spikes that look like they might be related [10:31:40] (that row is collapsed by default) [10:31:52] (03PS8) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [10:31:54] (03PS9) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [10:32:20] RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:32:26] (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:33:33] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:33:42] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:34:01] seems like "type fallback" gets called more frequently? could that cause more sparql queries to be sent? [10:34:17] it would, yeah [10:34:34] we try to answer “is X an instance of (subclass of) Y” by loading the entities in PHP first, and then fall back to using SPARQL instead [10:35:45] might be data related then, wondering if we should relax the rate limiter on wdqs-internal [10:36:05] yeah, I also wonder if it’s related to a change to P31/P279 statements on some very common item [10:37:23] ok I'll start a task (we might just decline it if we're OK with the current behavior) [10:37:30] ok, thanks [10:40:28] SPARQL timeouts don’t seem to be common at all compared to the huge number of requests https://graphite.wikimedia.org/render?from=-2d&height=308&target=alias(movingAverage(consolidateBy(MediaWiki.wikibase.quality.constraints.sparql.error.timeout.count,%20%27sum%27),%205),%20%27timeout%27)&to=now&width=586 [10:40:48] (03PS9) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [10:40:50] (03PS10) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [10:41:11] (we set the timeout to 5 seconds) [10:42:28] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:42:30] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [10:47:14] some timeouts are being processed by blazegraph and it might return http-500 on these [10:47:16] (03PS1) 10Jelto: gitlab_runner: remove protected tag from Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/870521 (https://phabricator.wikimedia.org/T325069) [10:48:15] checking the logs it's mostly the "SELECT DISTINCT ?otherEntity WHERE ..." one [10:48:17] (03CR) 10Jelto: [C: 04-1] "https://gitlab.wikimedia.org/repos/abstract-wiki/ci-images/-/merge_requests/1 needs to be merged first" [puppet] - 10https://gerrit.wikimedia.org/r/870521 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto) [10:54:34] Emperor: back, going to write a short incident status doc now [10:55:44] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:57:46] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:59:44] (03PS10) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [10:59:46] (03PS11) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [11:07:17] (03CR) 10Jbond: sre.discovery.service-route: refactor to base/runner classes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:10:14] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:11:40] !log installing php7.3 security updates on buster [11:11:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:49] Emperor: first draft https://wikitech.wikimedia.org/wiki/Incidents/2022-12-21_shellbox-syntaxhighlight - feel free to amend [11:17:43] (03PS11) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [11:17:45] (03PS12) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [11:17:59] (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:18:12] thanks for the review jbond :) [11:19:25] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:19:27] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [11:23:15] 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10Vgutierrez) [11:25:33] j.ayme: thanks, have tweaked a bit, but looks good [11:27:47] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh) [11:36:18] (03PS9) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729 [11:38:55] (03PS12) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [11:38:57] (03PS13) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [11:39:00] /7 [11:39:04] err sorry :) [11:40:51] !log installing joblib security updates [11:40:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:40] (03PS1) 10Jcrespo: dbbackups: Set minimum database backup size to 10 000 bytes [puppet] - 10https://gerrit.wikimedia.org/r/870522 [11:44:02] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [11:44:22] (03PS2) 10Jcrespo: dbbackups: Set minimum database backup size to 10 000 bytes [puppet] - 10https://gerrit.wikimedia.org/r/870522 [11:44:24] (03PS4) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739 [11:45:14] !log installing libbluray bugfix update for buster [11:45:15] (03CR) 10Jbond: kafka_config: set a real string for default api_version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [11:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:31] (03CR) 10Slyngshede: [C: 03+2] C:ldap::client::utils absent ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/869776 (owner: 10Slyngshede) [11:45:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38912/console" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond) [11:47:57] (03PS3) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) [11:50:01] !log instaling libde265 security updates [11:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:14] (03PS1) 10Muehlenhoff: Add library hint for libde265 [puppet] - 10https://gerrit.wikimedia.org/r/870523 [11:51:22] (03PS1) 10Slyngshede: C:ldap::client::utils remove ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/870524 [11:51:24] (03CR) 10Jcrespo: [C: 03+2] dbbackups: Set minimum database backup size to 10 000 bytes [puppet] - 10https://gerrit.wikimedia.org/r/870522 (owner: 10Jcrespo) [11:51:27] (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/870522/38913/backupmon1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/870522 (owner: 10Jcrespo) [12:00:12] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libde265 [puppet] - 10https://gerrit.wikimedia.org/r/870523 (owner: 10Muehlenhoff) [12:01:19] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10Wangombe) Done. I've updated my email address to my foundation email. [12:01:21] (03CR) 10Jbond: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [12:01:47] (03PS1) 10Jcrespo: dbbackups: Start backin up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) [12:02:10] (03PS2) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) [12:02:35] !log cgoubert@cumin1001 conftool action : set/pooled=yes:weight=1; selector: dc=eqiad,cluster=parsoid,name=parse1003.eqiad.wmnet,service=canary [12:05:16] (03PS2) 10Jbond: wmflib: add new function to get first usable ip from network [puppet] - 10https://gerrit.wikimedia.org/r/869785 [12:07:24] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:11:10] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [12:12:11] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Clement_Goubert) [12:12:41] (03CR) 10Jbond: [C: 03+2] wmflib: add new function to get first usable ip from network [puppet] - 10https://gerrit.wikimedia.org/r/869785 (owner: 10Jbond) [12:12:45] 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Clement_Goubert) [12:13:58] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) [12:14:09] (03PS3) 10Jbond: sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 [12:16:04] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [12:16:08] (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 (owner: 10Jbond) [12:17:18] (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [12:17:38] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond) [12:17:56] (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 (owner: 10Jbond) [12:17:58] (03CR) 10CI reject: [V: 04-1] sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [12:18:54] (03PS2) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) [12:19:50] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:19:51] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [12:19:53] (03PS3) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) [12:21:23] (03CR) 10CI reject: [V: 04-1] sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert) [12:24:55] (03PS4) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) [12:30:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:40:33] (03CR) 10Muehlenhoff: "I'll deploy this when we're back in January." [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff) [12:40:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [12:42:19] (03PS3) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) [12:45:09] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:45:44] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:47:12] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:48:57] (03PS1) 10Jbond: Revert "rake - spdx: also check hiera files" [puppet] - 10https://gerrit.wikimedia.org/r/869801 [12:50:44] (03CR) 10Jbond: "@riccardo, going to revert this as adding headers to SPDX is a bit overkill and dosen't really add anything. however i cant remember what" [puppet] - 10https://gerrit.wikimedia.org/r/869801 (owner: 10Jbond) [12:52:27] (03CR) 10Clément Goubert: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche) [12:54:18] (03PS2) 10Jcrespo: Improvements on css [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 (owner: 10Ladsgroup) [12:54:20] (03PS1) 10Jcrespo: Add missing analytics backups monitoring [software/pampinus] - 10https://gerrit.wikimedia.org/r/870549 [12:54:22] (03PS1) 10Jcrespo: pampinus: Fix bugs with codfw-only sections & very small backups [software/pampinus] - 10https://gerrit.wikimedia.org/r/870550 (https://phabricator.wikimedia.org/T313582) [12:54:38] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, but let's wait for Riccardo" [puppet] - 10https://gerrit.wikimedia.org/r/869801 (owner: 10Jbond) [12:56:17] (03PS4) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) [12:58:08] (03CR) 10CI reject: [V: 04-1] dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [13:04:33] dcausse: I think from the constraints side we can live with the errors in T325730 for now [13:04:34] T325730: Wikidata constraint check is getting throttled from wdqs-internal more than usual - https://phabricator.wikimedia.org/T325730 [13:05:04] the relevant product people on our side are already on holiday, and I think this isn’t urgent enough to call them back, so I’d expect it to be prioritized next year [13:05:28] (if it’s a serious issue from the WDQS side, we can still try to do something about it… I’m still around until the end of this week ^^) [13:08:05] has there been any recent commits to Profile::Wmcs::Cloudlb::Haproxy ? I got a CI error about those [13:08:54] (03CR) 10Jcrespo: "rebuild" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [13:09:18] Lucas_WMDE: the load on the internal wdqs cluster seems to stay reasonably constant, so no emergency. Looks like the throttling is working as expected and protecting the service. [13:09:32] phew :) [13:09:58] good that it’s working, I think I remember Stas being quite insistent that we needed to implement it :D [13:10:03] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [13:10:19] (and it’s also fortunate that we’re no longer (ab)using WDQS for regex checking) [13:11:12] using SPARQL for regex checking is the stuff that could give me nightmares! [13:13:13] Lucas_WMDE: sure! thanks for checking, no urgency on my side either [13:13:20] gehel: we needed something that had a timeout, unlike preg_match 😔 [13:13:57] "if the only tool you have is a hammer, everything looks like a thumb" [13:14:06] yup [13:17:06] (03CR) 10Jcrespo: "Is it possible that b27f6a080aeb078c1d9c03 may have broken Puppet's CI? https://integration.wikimedia.org/ci/job/operations-puppet-tests-b" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [13:21:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:26:06] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:27:42] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:31:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [13:32:06] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [13:32:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:41] !log installing node-minimatch security updates [13:32:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:39:06] !log installing nano bugfix updates from Bullseye point release [13:39:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:46] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [14:15:39] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) p:05Triage→03High [14:19:18] (03CR) 10Alexandros Kosiaris: [C: 03+2] rsync: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/869781 (owner: 10Alexandros Kosiaris) [14:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:29:46] 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) [14:33:40] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:35:24] (03PS1) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [14:35:45] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [14:36:02] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 2552 MB (3% inode=58%): /tmp 2552 MB (3% inode=58%): /var/tmp 2552 MB (3% inode=58%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [14:39:57] (03PS2) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [14:40:18] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [14:44:10] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:44:32] 10SRE, 10API Platform, 10Commons, 10MediaWiki-File-management, and 7 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10VirginiaPoundstone) [14:48:04] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:51:56] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:53:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:56:26] (03CR) 10Mvolz: [C: 03+1] Specify Citoid RESTBase URL separately (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński) [14:56:38] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [14:56:53] (03PS3) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [14:58:27] (03PS1) 10Muehlenhoff: os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558 [14:58:36] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:58:43] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:00:17] (03CR) 10CI reject: [V: 04-1] os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff) [15:00:24] (03PS2) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [15:00:45] (03PS4) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [15:02:19] (03CR) 10CI reject: [V: 04-1] swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:02:45] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:02:54] (03PS1) 10Muehlenhoff: Remove access for jeh [puppet] - 10https://gerrit.wikimedia.org/r/870559 [15:05:17] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jeh [puppet] - 10https://gerrit.wikimedia.org/r/870559 (owner: 10Muehlenhoff) [15:14:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:18:37] (03PS5) 10Jbond: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:18:39] (03PS1) 10Jbond: wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562 [15:19:06] (03CR) 10Jbond: [C: 03+2] wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562 (owner: 10Jbond) [15:19:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [15:20:48] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:20:54] (03CR) 10CI reject: [V: 04-1] wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562 (owner: 10Jbond) [15:21:32] (03PS2) 10Jbond: wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562 [15:22:09] (03CR) 10JHathaway: [C: 03+2] Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [15:22:44] (03CR) 10MVernon: "[setting self to CC so I know when I can rebase my CRs]" [puppet] - 10https://gerrit.wikimedia.org/r/870562 (owner: 10Jbond) [15:24:15] (03PS6) 10Jbond: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:26:02] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:26:04] (03PS4) 10JHathaway: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) [15:27:16] (03PS3) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) [15:27:38] (03PS7) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [15:28:27] (03CR) 10JHathaway: [C: 03+2] Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [15:30:04] (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [15:31:17] (03PS3) 10JHathaway: Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) [15:32:23] (03CR) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:33:05] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:35:09] (03PS8) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) [15:36:48] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 145 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:36:54] (03CR) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo) [15:38:24] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:38:43] (03CR) 10Jbond: [C: 03+1] Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [15:42:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [15:43:51] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [15:50:53] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BCornwall) [15:51:07] 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10BCornwall) 05Open→03Resolved Thanks for handling that, @ayounsi! [15:51:17] (03CR) 10JHathaway: [C: 03+2] Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway) [15:53:12] (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [15:53:19] (03PS13) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [15:53:21] (03PS14) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [15:54:56] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [15:55:06] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [15:55:39] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) a:03BCornwall [15:56:11] (03PS14) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [15:56:13] (03PS15) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [15:57:47] (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [15:57:54] (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey) [15:59:08] reallyyyyyy [16:02:24] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri) [16:02:49] (03PS16) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [16:08:43] (03PS15) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) [16:08:45] (03PS17) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [16:18:14] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) [16:20:11] 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758 (10fnegri) [16:20:40] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10wiki_willy) a:03Cmjohnson [16:21:25] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [16:25:54] (03PS1) 10JHathaway: concat: make compatible *again* with stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) [16:27:22] (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38918/console" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [16:27:59] (03CR) 10JHathaway: [V: 03+1] "review kindly!" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [16:37:19] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff) [16:38:46] (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [16:39:05] (03CR) 10JHathaway: [V: 03+1 C: 03+2] concat: make compatible *again* with stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway) [16:50:57] (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [16:54:15] (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38920/console" [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [16:54:54] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri) [17:07:58] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:08:16] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:10:22] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:19:26] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:32:36] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:33:08] (03CR) 10Dzahn: "reading the latest ticket comments it sounds like this is not desired after all. should I just abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [17:33:32] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:34:11] (03CR) 10Dzahn: "thanks! assuming no deployments are happening anyways during the holidays, I will merge this after code freeze then?" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [17:34:22] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [17:36:36] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:37:16] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:37:27] (03CR) 10Clément Goubert: mediawiki: download geoip databases on deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [17:41:33] 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10Dzahn) @ayounsi It would mean a considerable effort to recreate an entire LVS service, which we just recently shut down for Phabricator in a lenghty decom pr... [17:50:35] (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [17:59:21] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: reformat file [puppet] - 10https://gerrit.wikimedia.org/r/870665 [17:59:23] (03PS1) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes [puppet] - 10https://gerrit.wikimedia.org/r/870686 (https://phabricator.wikimedia.org/T325755) [18:03:37] (03CR) 10Dzahn: "no problem, sometimes things just change" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [18:03:52] (03Abandoned) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn) [18:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:22:09] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) [18:22:32] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) @XenoRyet can you grant approval for this access? Thanks! [18:22:51] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#8483180, @Dzahn wrote: > If you are asking for access to the Search Console, please clarify who needs access to what and add t... [18:34:31] (03CR) 10Majavah: [C: 04-1] "I don't think the non-privileged PSP should have access to data on the nodes directly." [puppet] - 10https://gerrit.wikimedia.org/r/870686 (https://phabricator.wikimedia.org/T325755) (owner: 10Arturo Borrero Gonzalez) [18:40:55] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) p:05Triage→03Medium [18:43:51] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) @Ottomata can you confirm if this also needs analytics-privatedata-users group membership without ssh and kerberos? [18:52:39] (03PS4) 10Jcrespo: miniloader: Draft small utilitiy to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) [19:04:29] (03PS5) 10Jcrespo: miniloader: Draft small utility to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383) [19:32:33] (03PS5) 10Vlad.shapik: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T325770) [19:34:01] 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10BCornwall) 05Stalled→03In progress a:03BCornwall [19:48:15] (03PS8) 10Vlad.shapik: Add the ability to specify the default DPI value for PDF files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771) [19:58:16] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Dzahn) >>! In T325607#8485221, @Soda wrote: > Any idea who might be the best person to contact regarding this ? I didn't have an individual name.... [20:00:19] 10SRE, 10Wikimedia-Mailing-lists: Transfer ownership of Art+Feminism Wikimedians Mailing List to new moderators - https://phabricator.wikimedia.org/T325467 (10Dzahn) Is there actually a request for SRE in this? I think not, it was just auto-tagged SRE by maintenance bot. Let us know if that's not correct and... [20:01:15] (03PS1) 10BCornwall: admin: Add kelhurd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/870708 (https://phabricator.wikimedia.org/T323943) [20:04:08] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:05:44] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:13:02] (03PS3) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 [20:16:32] PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 1514 MB (2% inode=57%): /tmp 1514 MB (2% inode=57%): /var/tmp 1514 MB (2% inode=57%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [20:20:25] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) [20:23:25] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) a:03SCherukuwada [20:23:43] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [20:25:46] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) I need to do a couple of things to first make sure I have access to Wikisource in search console. As soon as that happens I'll dig ri... [20:26:32] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:27:06] 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10RobH) 05Open→03Resolved [20:27:52] 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10RobH) 05Open→03Resolved a:03RobH invalid due to https://netbox.wikimedia.org/dcim/devices/2188/ [20:28:08] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:33:37] !log robh@cumin2002 START - Cookbook sre.dns.netbox [20:34:22] (03CR) 10SBassett: [C: 03+1] admin: Add kelhurd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/870708 (https://phabricator.wikimedia.org/T323943) (owner: 10BCornwall) [20:34:46] !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:35:01] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH) [20:35:04] 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10RobH) 05Open→03Resolved a:03RobH confirmed all servers on this task are indeed decommissioned in netbox, removed the cable assignm... [20:36:28] 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10RobH) [20:36:34] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) 05Open→03Stalled I'm setting this to stalled as the upgrade parent task should resolve this issue. [20:37:10] RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops [20:45:41] 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Darwinius) A single URL of a work I recently added to ws.pt, and on which I've been working in, appears to have been noticed by Google: https://pt.... [20:49:22] 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773 (10Andrew) [20:49:33] (03PS4) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) [20:51:30] (03CR) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott) [20:53:44] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder) [20:59:38] (03PS1) 10Bking: query_service: Allow query hosts to rsync data from clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) [21:02:51] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking) [21:03:21] (03PS1) 10JHathaway: g10k cleanup [puppet] - 10https://gerrit.wikimedia.org/r/870717 [21:04:02] (03PS2) 10JHathaway: g10k cleanup [puppet] - 10https://gerrit.wikimedia.org/r/870717 [21:10:09] (03CR) 10JHathaway: [C: 03+2] g10k cleanup [puppet] - 10https://gerrit.wikimedia.org/r/870717 (owner: 10JHathaway) [21:22:06] (03CR) 10Ahmon Dancy: "Other than the readOnly issue, this works fine in train-dev where the pods end up with empty /usr/share/GeoIP* directories because the min" [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [21:42:48] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/870535 [21:45:28] 10SRE, 10Wikimedia-Mailing-lists: Transfer ownership of Art+Feminism Wikimedians Mailing List to new moderators - https://phabricator.wikimedia.org/T325467 (10Masssly) No SRE request is needed. Thanks. [22:08:33] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) a:05Wangombe→03BCornwall [22:20:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:50:26] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [22:51:58] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:10:24] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [23:11:58] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [23:41:32] PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:43:00] RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5030 is OK: HTTP OK: HTTP/1.1 200 Ok - 48238 bytes in 0.923 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:45:38] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:50:28] PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status