[00:03:56] <icinga-wm>	 PROBLEM - Check systemd state on lists1001 is CRITICAL: CRITICAL - degraded: The following units failed: discard_held_messages.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:12] <wikibugs>	 (03PS1) 10Jbond: django-sso: improve debug page [puppet] - 10https://gerrit.wikimedia.org/r/869857
[00:07:11] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[00:08:51] <wikibugs>	 (03PS5) 10Jbond: Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[00:09:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[00:10:44] <wikibugs>	 (03PS3) 10Jbond: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[00:26:47] <wikibugs>	 (03CR) 10Jbond: Add vendored module bodgit/puppet-postfix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway)
[00:46:28] <icinga-wm>	 PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:51:14] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:00:23] <wikibugs>	 10SRE, 10Traffic, 10Performance-Team (Radar): Adapt all the things to localized Special: namespaces - https://phabricator.wikimedia.org/T105434 (10Krinkle)
[01:40:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:05:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:10:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:45] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:16:02] <icinga-wm>	 RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:19:20] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:48] <icinga-wm>	 PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:18] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:37:06] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:48:18] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:17:25] <wikibugs>	 (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869260
[05:21:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/869260 (owner: 10PipelineBot)
[05:23:32] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:33:02] <icinga-wm>	 RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:36:20] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:37:50] <icinga-wm>	 PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:14:42] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account-stats_tegola:prod.service,swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:29:02] <icinga-wm>	 RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:33:40] <icinga-wm>	 PROBLEM - Check systemd state on es1024 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:34:20] <wikibugs>	 (03PS1) 10Marostegui: analytics_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869867 (https://phabricator.wikimedia.org/T325154)
[06:34:59] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[06:38:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] analytics_multiinstance.my.cnf.erb: Remove unix_socket mention [puppet] - 10https://gerrit.wikimedia.org/r/869867 (https://phabricator.wikimedia.org/T325154) (owner: 10Marostegui)
[06:39:59] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:43:05] <wikibugs>	 (03PS1) 10KartikMistry: WIP: Enable Content Translation/Section Translation on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870080 (https://phabricator.wikimedia.org/T325714)
[07:57:48] <icinga-wm>	 PROBLEM - SSH on wdqs2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:58:33] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:ldap::management use bitu-ldap from add-ldap-group [puppet] - 10https://gerrit.wikimedia.org/r/869824 (owner: 10Slyngshede)
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221221T0800)
[08:06:54] <icinga-wm>	 PROBLEM - SSH on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:10:50] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2010 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:17:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869776 (owner: 10Slyngshede)
[08:23:30] <icinga-wm>	 PROBLEM - SSH on wdqs2011 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:26:50] <icinga-wm>	 PROBLEM - SSH on wdqs2012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:28:42] <icinga-wm>	 PROBLEM - Query Service HTTP Port on wdqs2009 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 649 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[08:28:46] <icinga-wm>	 PROBLEM - Check systemd state on wdqs2009 is CRITICAL: CRITICAL - Failed to connect to bus: Resource temporarily unavailable: unexpected https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:32:19] <ryankemper>	 !log Downtiming wdqs 20[09-12] until 2023-01-02 (these are new hosts not yet properly brought into service)
[08:32:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:42:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:47:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[08:51:04] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:51:12] <wikibugs>	 (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (033 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[08:51:36] <icinga-wm>	 RECOVERY - SSH on wdqs2009 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:51:53] <wikibugs>	 (03PS2) 10Elukey: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677)
[08:51:55] <wikibugs>	 (03PS5) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[08:51:57] <wikibugs>	 (03PS4) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[08:53:17] <wikibugs>	 (03CR) 10Elukey: "Thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[08:53:44] <icinga-wm>	 RECOVERY - SSH on wdqs2010 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:53:50] <wikibugs>	 (03PS3) 10Elukey: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677)
[08:53:52] <wikibugs>	 (03PS6) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[08:53:54] <wikibugs>	 (03PS5) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[08:55:23] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[08:57:08] <icinga-wm>	 RECOVERY - Check systemd state on wdqs2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:57:17] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.pool-depool-cluster: update SAL/log description and add comments [cookbooks] - 10https://gerrit.wikimedia.org/r/869236 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[09:00:18] <jinxer-wm>	 (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:01:14] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4516 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:02:03] <jayme>	 Emperor: I think we're going to get paged soon
[09:02:16] <wikibugs>	 (03PS7) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[09:02:18] <wikibugs>	 (03PS6) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[09:02:24] <wikibugs>	 (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[09:03:18] <jinxer-wm>	 (ProbeDown) firing: Service shellbox-syntaxhighlight:4014 has failed probes (http_shellbox-syntaxhighlight_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#shellbox-syntaxhighlight:4014 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:03:52] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 5810 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:03:57] <jayme>	 there we go
[09:04:01] <Emperor>	 jayme: indeed so :-/
[09:04:12] <jayme>	 acked
[09:05:04] <Emperor>	 slightly odd set of things to be alerting
[09:05:17] <jayme>	 checking shellbox
[09:05:50] <jayme>	 big request spike in eqiad
[09:06:00] <jayme>	 +70req/s
[09:06:22] <icinga-wm>	 RECOVERY - SSH on wdqs2011 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:07:36] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6935 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:08:12] <jayme>	 !log increasing replicas of shellbox-syntaxhighlight from 12 to 50
[09:08:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:08:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:08:38] <Emperor>	 jayme: https://grafana.wikimedia.org/d/RKogW1m7z/shellbox?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=shellbox&var-namespace=shellbox&var-release=main&from=now-3h&to=now isn't showing me a request spike, where should I be looking?
[09:09:02] <jayme>	 Emperor: select shellbox-syntaxhighlight at the top
[09:09:08] <jayme>	 there are a bunch of shellboxes
[09:09:12] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[09:09:17] <jayme>	 !log correction: increasing replicas of shellbox-syntaxhighlight from 12 to 40
[09:09:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:10:16] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 3 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[09:10:18] <jinxer-wm>	 (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:10:19] <jayme>	 but looks like the request volume came back down on it's own again before I scaled up
[09:11:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 209k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[09:12:23] <Emperor>	 jayme: reqs still on the high side per grafana; worth trying to find out why, or see if it subsides given we're now managing to service those requests?
[09:13:15] <jayme>	 I think it's worth it finding out what's happening
[09:16:01] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic eqiad.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 209k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=eqiad%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[09:16:22] <dcausse>	 the CirrusSearchJobQueueBacklogTooBig alert is related to mw jobrunners and job processing times are higher (mean from 300ms to almost 1s) so not specific to api_server perhaps?
[09:16:38] <dcausse>	 job times are decreasing now
[09:16:51] <Emperor>	 Hm, went looking in logstash for kubernetes.namespace_name:"shellbox-syntaxhighlight" but that's not actually any use because it filters out the 200s
[09:16:52] <jayme>	 dcausse: thanks
[09:17:53] <jayme>	 Emperor: I had assumed an edit spike
[09:19:41] <wikibugs>	 (03PS7) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[09:20:00] <wikibugs>	 (03CR) 10Elukey: "Thanks a lot for the review :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[09:21:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[09:22:09] <Emperor>	 jayme: would sound plausible, but our edits/s graph looks unremarkable
[09:22:46] <jayme>	 indeed
[09:22:49] <RhinosF1>	 Shellbox has had an issue before where preview has been reparsing too often
[09:23:20] <RhinosF1>	 I believe it's long fixed but not only edits will trigger it
[09:24:48] <wikibugs>	 (03PS8) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[09:24:58] <jayme>	 RhinosF1: right
[09:25:31] <jayme>	 Emperor: the big spike seems to have origined from the jobrunners https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=shellbox-syntaxhighlight
[09:26:03] <jayme>	 so probably related to the cirrussearch backlog after all
[09:27:01] <dcausse>	 this cirrus backlog correlates with page re-renders (edits+template change propagation)
[09:28:11] <jayme>	 makes sense
[09:29:02] <dcausse>	 possibly a popular template with some syntaxhighlighting tag got edited? 
[09:29:23] <jayme>	 api servers also issued way more shellbox requests during that period...that seems kind of unexpected as jobrunners should be handling those requests themselves, no?
[09:29:39] <jayme>	 dcausse: maybe. no idea how to figure that out tbh
[09:34:56] <jayme>	 Emperor: we still have quite elevated latency in eqiad according to https://grafana-rw.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-3h&to=now
[09:35:55] <jayme>	 might as well be europe getting up though
[09:37:17] <Emperor>	 jayme: yeah, if you expand to last 12 or 24 h, it's not outside our normal range
[09:37:24] <jayme>	 ack
[09:38:34] <Emperor>	 So beyond the slightly vexing question of what caused the spike (d.causse's theory seems sound but I've no idea where we'd find it), I think we're good again
[09:41:28] <jayme>	 well, syntaxhighligt requests are still above normal rate. maybe that's still jobrunner backlog getting processed...
[09:41:42] <icinga-wm>	 RECOVERY - SSH on wdqs2012 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:46:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:49:12] <dcausse>	 backlog (at least the cirrus one is absorbed now) https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=eqiad.mediawiki.job.cirrusSearchLinksUpdate&var-consumer_group=All but p99 of this job are great tho (flat around 40sec perhaps hitting a timeout?)
[09:49:23] <dcausse>	 s/great/not great/
[09:51:07] <wikibugs>	 (03CR) 10Jaime Nuche: [C: 03+1] admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[09:51:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[09:53:56] <jayme>	 hmm
[09:54:46] <jayme>	 I'm going to scale shellbox back down to 12 for now (as that side effect seems fine)
[09:55:17] <jayme>	 !log scaling shellbox-syntaxhighlight back to 12 replicas
[09:55:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:55:22] <Emperor>	 seems sensible
[10:01:18] <jayme>	 dcausse: I'm not super familiar but from the envoy metrics of jobrunners it seems that wdqs is kinda slow
[10:01:28] <jayme>	 and reponding with more errors than usual
[10:01:41] <jayme>	 https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=wdqs-internal&from=now-3h&to=now
[10:07:45] <dcausse>	 jayme: indeed... looking
[10:08:03] <jayme>	 <3
[10:10:04] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:10:58] <dcausse>	 jayme: expanding the time range it appears to be bit more usual https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=wdqs-internal&from=now-2d&to=now
[10:13:24] <dcausse>	 reason might be related to search usage by expert users (e.g. searching for deepcat:A_Category will call wdqs-internal and possibly traverse a huge category graph that might timeout)
[10:14:54] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:16:17] <dcausse>	 hm scratch this idea it's from jobrunners so it's related to wikidata constraint checks
[10:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:22:30] <jayme>	 it seems the p99 job runner duration has declined as well. It's also a 1h max, so it probably stays at it's max for 60min even if there was a decline (at least that's what I understand)
[10:23:17] <dcausse>	 oh right, makes sense
[10:24:13] <jayme>	 Okay. Let's call it closed then. We should still create an incident report as something like that is bound to happen again 
[10:24:16] <jayme>	 Emperor: I have to run a quick errand, no longer than 15min
[10:26:03] <dcausse>	 Lucas_WMDE: do you know if we collect some metrics regarding wikidata constraint checks (esp. the job constraintRunCheck which I believe talks to wdqs-internal)?
[10:26:45] <Emperor>	 j.ayme: ack
[10:27:10] <Lucas_WMDE>	 dcausse: let me see
[10:27:50] <Lucas_WMDE>	 https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&refresh=30s might have some useful metrics
[10:27:57] <Lucas_WMDE>	 especially the SPARQL section, I guess
[10:28:01] <dcausse>	 nothing urgent but just wondering if we should worry about the weird patterns we see here: https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-origin=jobrunner&var-origin_instance=All&var-destination=wdqs-internal&from=now-2d&to=now
[10:28:03] <dcausse>	 Lucas_WMDE: yes
[10:28:33] <Lucas_WMDE>	 looks like there were enough queries to get WBQC throttled
[10:29:11] <Lucas_WMDE>	 which doesn’t usually seem to happen https://grafana.wikimedia.org/d/000000344/wikidata-quality?orgId=1&refresh=30s&viewPanel=26&from=now-90d&to=now
[10:30:23] <dcausse>	 I think I might file a task something seems to degrade
[10:30:29] <Lucas_WMDE>	 nothing stands out in the wbcheckconstraints API requests though https://grafana.wikimedia.org/d/000000559/api-requests-breakdown?refresh=5m&orgId=1&from=now-2d&to=now&var-metric=p95&var-module=wbcheckconstraints
[10:31:10] <Lucas_WMDE>	 nor in the constraintsRunCheck jobs afaict https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-job=constraintsRunCheck&var-dc=eqiad%20prometheus%2Fk8s
[10:31:37] <Lucas_WMDE>	 ah, no, the job backlog time there got a bit backlogged, with spikes that look like they might be related
[10:31:40] <Lucas_WMDE>	 (that row is collapsed by default)
[10:31:52] <wikibugs>	 (03PS8) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[10:31:54] <wikibugs>	 (03PS9) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[10:32:20] <icinga-wm>	 RECOVERY - Check systemd state on es1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:32:26] <wikibugs>	 (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:33:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:33:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:34:01] <dcausse>	 seems like "type fallback" gets called more frequently? could that cause more sparql queries to be sent?
[10:34:17] <Lucas_WMDE>	 it would, yeah
[10:34:34] <Lucas_WMDE>	 we try to answer “is X an instance of (subclass of) Y” by loading the entities in PHP first, and then fall back to using SPARQL instead
[10:35:45] <dcausse>	 might be data related then, wondering if we should relax the rate limiter on wdqs-internal
[10:36:05] <Lucas_WMDE>	 yeah, I also wonder if it’s related to a change to P31/P279 statements on some very common item
[10:37:23] <dcausse>	 ok I'll start a task (we might just decline it if we're OK with the current behavior)
[10:37:30] <Lucas_WMDE>	 ok, thanks
[10:40:28] <Lucas_WMDE>	 SPARQL timeouts don’t seem to be common at all compared to the huge number of requests https://graphite.wikimedia.org/render?from=-2d&height=308&target=alias(movingAverage(consolidateBy(MediaWiki.wikibase.quality.constraints.sparql.error.timeout.count,%20%27sum%27),%205),%20%27timeout%27)&to=now&width=586
[10:40:48] <wikibugs>	 (03PS9) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[10:40:50] <wikibugs>	 (03PS10) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[10:41:11] <Lucas_WMDE>	 (we set the timeout to 5 seconds)
[10:42:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:42:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[10:47:14] <dcausse>	 some timeouts are being processed by blazegraph and it might return http-500 on these
[10:47:16] <wikibugs>	 (03PS1) 10Jelto: gitlab_runner: remove protected tag from Trusted Runners [puppet] - 10https://gerrit.wikimedia.org/r/870521 (https://phabricator.wikimedia.org/T325069)
[10:48:15] <dcausse>	 checking the logs it's mostly the "SELECT DISTINCT ?otherEntity WHERE ..." one
[10:48:17] <wikibugs>	 (03CR) 10Jelto: [C: 04-1] "https://gitlab.wikimedia.org/repos/abstract-wiki/ci-images/-/merge_requests/1 needs to be merged first" [puppet] - 10https://gerrit.wikimedia.org/r/870521 (https://phabricator.wikimedia.org/T325069) (owner: 10Jelto)
[10:54:34] <jayme>	 Emperor: back, going to write a short incident status doc now
[10:55:44] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:57:46] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[10:59:44] <wikibugs>	 (03PS10) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[10:59:46] <wikibugs>	 (03PS11) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[11:07:17] <wikibugs>	 (03CR) 10Jbond: sre.discovery.service-route: refactor to base/runner classes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:10:14] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:11:40] <moritzm>	 !log installing php7.3 security updates on buster
[11:11:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:13:49] <jayme>	 Emperor: first draft https://wikitech.wikimedia.org/wiki/Incidents/2022-12-21_shellbox-syntaxhighlight - feel free to amend
[11:17:43] <wikibugs>	 (03PS11) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[11:17:45] <wikibugs>	 (03PS12) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[11:17:59] <wikibugs>	 (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:18:12] <elukey>	 thanks for the review jbond :)
[11:19:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:19:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[11:23:15] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Package and deploy ATS 9.1.4 - https://phabricator.wikimedia.org/T325563 (10Vgutierrez)
[11:25:33] <Emperor>	 j.ayme: thanks, have tweaked a bit, but looks good
[11:27:47] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/869282 (https://phabricator.wikimedia.org/T325563) (owner: 10Ssingh)
[11:36:18] <wikibugs>	 (03PS9) 10Giuseppe Lavagetto: sre.switchdc.mediawiki: adapt to a/a mediawiki [cookbooks] - 10https://gerrit.wikimedia.org/r/836729
[11:38:55] <wikibugs>	 (03PS12) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[11:38:57] <wikibugs>	 (03PS13) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[11:39:00] <elukey>	 /7
[11:39:04] <elukey>	 err sorry :)
[11:40:51] <moritzm>	 !log installing joblib security updates
[11:40:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:40] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Set minimum database backup size to 10 000 bytes [puppet] - 10https://gerrit.wikimedia.org/r/870522
[11:44:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff)
[11:44:22] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Set minimum database backup size to 10 000 bytes [puppet] - 10https://gerrit.wikimedia.org/r/870522
[11:44:24] <wikibugs>	 (03PS4) 10Jbond: kafka_config: set a real string for default api_version [puppet] - 10https://gerrit.wikimedia.org/r/868739
[11:45:14] <moritzm>	 !log installing libbluray bugfix update for buster
[11:45:15] <wikibugs>	 (03CR) 10Jbond: kafka_config: set a real string for default api_version (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond)
[11:45:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:45:31] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] C:ldap::client::utils absent ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/869776 (owner: 10Slyngshede)
[11:45:44] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38912/console" [puppet] - 10https://gerrit.wikimedia.org/r/868739 (owner: 10Jbond)
[11:47:57] <wikibugs>	 (03PS3) 10Jbond: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783)
[11:50:01] <moritzm>	 !log instaling libde265 security updates
[11:50:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:14] <wikibugs>	 (03PS1) 10Muehlenhoff: Add library hint for libde265 [puppet] - 10https://gerrit.wikimedia.org/r/870523
[11:51:22] <wikibugs>	 (03PS1) 10Slyngshede: C:ldap::client::utils remove ldapsupportlib [puppet] - 10https://gerrit.wikimedia.org/r/870524
[11:51:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] dbbackups: Set minimum database backup size to 10 000 bytes [puppet] - 10https://gerrit.wikimedia.org/r/870522 (owner: 10Jcrespo)
[11:51:27] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/870522/38913/backupmon1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/870522 (owner: 10Jcrespo)
[12:00:12] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for libde265 [puppet] - 10https://gerrit.wikimedia.org/r/870523 (owner: 10Muehlenhoff)
[12:01:19] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10Wangombe) Done. I've updated my email address to my foundation email.
[12:01:21] <wikibugs>	 (03CR) 10Jbond: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[12:01:47] <wikibugs>	 (03PS1) 10Jcrespo: dbbackups: Start backin up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582)
[12:02:10] <wikibugs>	 (03PS2) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582)
[12:02:35] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=yes:weight=1; selector: dc=eqiad,cluster=parsoid,name=parse1003.eqiad.wmnet,service=canary
[12:05:16] <wikibugs>	 (03PS2) 10Jbond: wmflib: add new function to get first usable ip from network [puppet] - 10https://gerrit.wikimedia.org/r/869785
[12:07:24] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:11:10] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond)
[12:12:11] <wikibugs>	 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Clement_Goubert)
[12:12:41] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmflib: add new function to get first usable ip from network [puppet] - 10https://gerrit.wikimedia.org/r/869785 (owner: 10Jbond)
[12:12:45] <wikibugs>	 10SRE, 10DiscussionTools, 10MW-1.40-notes (1.40.0-wmf.17; 2023-01-02), 10Patch-For-Review, 10Wikimedia-Incident: API appserver CPU exhaustion probably due to DiscussionTools - https://phabricator.wikimedia.org/T325477 (10Clement_Goubert)
[12:13:58] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606)
[12:14:09] <wikibugs>	 (03PS3) 10Jbond: sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550
[12:16:04] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond)
[12:16:08] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 (owner: 10Jbond)
[12:17:18] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert)
[12:17:38] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: return status of the cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/867544 (https://phabricator.wikimedia.org/T324606) (owner: 10Jbond)
[12:17:56] <wikibugs>	 (03Merged) 10jenkins-bot: sre.hardware.upgrade-firmware: prevent upgrading drivers if idrac to low [cookbooks] - 10https://gerrit.wikimedia.org/r/867550 (owner: 10Jbond)
[12:17:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert)
[12:18:54] <wikibugs>	 (03PS2) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739)
[12:19:50] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:19:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[12:19:53] <wikibugs>	 (03PS3) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739)
[12:21:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739) (owner: 10Clément Goubert)
[12:24:55] <wikibugs>	 (03PS4) 10Clément Goubert: sre.mediawiki.restart-appservers: Fix clusters [cookbooks] - 10https://gerrit.wikimedia.org/r/870548 (https://phabricator.wikimedia.org/T325739)
[12:30:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:40:33] <wikibugs>	 (03CR) 10Muehlenhoff: "I'll deploy this when we're back in January." [puppet] - 10https://gerrit.wikimedia.org/r/860551 (https://phabricator.wikimedia.org/T311235) (owner: 10Muehlenhoff)
[12:40:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[12:42:19] <wikibugs>	 (03PS3) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582)
[12:45:09] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: haproxy: support 3 firewalling options [puppet] - 10https://gerrit.wikimedia.org/r/868726 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[12:45:44] <icinga-wm>	 PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX
[12:47:12] <icinga-wm>	 RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX
[12:48:57] <wikibugs>	 (03PS1) 10Jbond: Revert "rake - spdx: also check hiera files" [puppet] - 10https://gerrit.wikimedia.org/r/869801
[12:50:44] <wikibugs>	 (03CR) 10Jbond: "@riccardo, going to revert this as adding headers to SPDX is a bit overkill and dosen't really add anything.  however i cant remember what" [puppet] - 10https://gerrit.wikimedia.org/r/869801 (owner: 10Jbond)
[12:52:27] <wikibugs>	 (03CR) 10Clément Goubert: mwdebug_deploy: remove configuration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/867221 (owner: 10Jaime Nuche)
[12:54:18] <wikibugs>	 (03PS2) 10Jcrespo: Improvements on css [software/pampinus] - 10https://gerrit.wikimedia.org/r/829858 (owner: 10Ladsgroup)
[12:54:20] <wikibugs>	 (03PS1) 10Jcrespo: Add missing analytics backups monitoring [software/pampinus] - 10https://gerrit.wikimedia.org/r/870549
[12:54:22] <wikibugs>	 (03PS1) 10Jcrespo: pampinus: Fix bugs with codfw-only sections & very small backups [software/pampinus] - 10https://gerrit.wikimedia.org/r/870550 (https://phabricator.wikimedia.org/T313582)
[12:54:38] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, but let's wait for Riccardo" [puppet] - 10https://gerrit.wikimedia.org/r/869801 (owner: 10Jbond)
[12:56:17] <wikibugs>	 (03PS4) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582)
[12:58:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbbackups: Start backing up backup1-eqiad and backup1-codfw sections [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[13:04:33] <Lucas_WMDE>	 dcausse: I think from the constraints side we can live with the errors in T325730 for now
[13:04:34] <stashbot>	 T325730: Wikidata constraint check is getting throttled from wdqs-internal more than usual - https://phabricator.wikimedia.org/T325730
[13:05:04] <Lucas_WMDE>	 the relevant product people on our side are already on holiday, and I think this isn’t urgent enough to call them back, so I’d expect it to be prioritized next year
[13:05:28] <Lucas_WMDE>	 (if it’s a serious issue from the WDQS side, we can still try to do something about it… I’m still around until the end of this week ^^)
[13:08:05] <jynus>	 has there been any recent commits to Profile::Wmcs::Cloudlb::Haproxy ? I got a CI error about those
[13:08:54] <wikibugs>	 (03CR) 10Jcrespo: "rebuild" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[13:09:18] <gehel>	 Lucas_WMDE: the load on the internal wdqs cluster seems to stay reasonably constant, so no emergency. Looks like the throttling is working as expected and protecting the service.
[13:09:32] <Lucas_WMDE>	 phew :)
[13:09:58] <Lucas_WMDE>	 good that it’s working, I think I remember Stas being quite insistent that we needed to implement it :D
[13:10:03] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[13:10:19] <Lucas_WMDE>	 (and it’s also fortunate that we’re no longer (ab)using WDQS for regex checking)
[13:11:12] <gehel>	 using SPARQL for regex checking is the stuff that could give me nightmares!
[13:13:13] <dcausse>	 Lucas_WMDE: sure! thanks for checking, no urgency on my side either
[13:13:20] <Lucas_WMDE>	 gehel: we needed something that had a timeout, unlike preg_match 😔
[13:13:57] <gehel>	 "if the only tool you have is a hammer, everything looks like a thumb"
[13:14:06] <Lucas_WMDE>	 yup
[13:17:06] <wikibugs>	 (03CR) 10Jcrespo: "Is it possible that b27f6a080aeb078c1d9c03 may have broken Puppet's CI? https://integration.wikimedia.org/ci/job/operations-puppet-tests-b" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[13:21:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:26:06] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 119 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:27:42] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:31:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[13:32:06] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[13:32:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 180 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:32:41] <moritzm>	 !log installing node-minimatch security updates
[13:32:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:34:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[13:39:06] <moritzm>	 !log installing nano bugfix updates from Bullseye point release
[13:39:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[14:15:39] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata) p:05Triage→03High
[14:19:18] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] rsync: Fix a typo [puppet] - 10https://gerrit.wikimedia.org/r/869781 (owner: 10Alexandros Kosiaris)
[14:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:29:46] <wikibugs>	 10SRE-OnFire, 10Observability-Alerting, 10SRE Observability (FY2022/2023-Q3): Improve AlertManager alert titles as sent to VictorOps - https://phabricator.wikimedia.org/T317240 (10lmata)
[14:33:40] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:35:24] <wikibugs>	 (03PS1) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[14:35:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[14:36:02] <icinga-wm>	 PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 2552 MB (3% inode=58%): /tmp 2552 MB (3% inode=58%): /var/tmp 2552 MB (3% inode=58%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops
[14:39:57] <wikibugs>	 (03PS2) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[14:40:18] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[14:44:10] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:44:32] <wikibugs>	 10SRE, 10API Platform, 10Commons, 10MediaWiki-File-management, and 7 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10VirginiaPoundstone)
[14:48:04] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:51:56] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:53:22] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[14:56:26] <wikibugs>	 (03CR) 10Mvolz: [C: 03+1] Specify Citoid RESTBase URL separately (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/869226 (https://phabricator.wikimedia.org/T325425) (owner: 10Bartosz Dziewoński)
[14:56:38] <icinga-wm>	 RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops
[14:56:53] <wikibugs>	 (03PS3) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[14:58:27] <wikibugs>	 (03PS1) 10Muehlenhoff: os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558
[14:58:36] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:58:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:00:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] os-reports: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/870558 (owner: 10Muehlenhoff)
[15:00:24] <wikibugs>	 (03PS2) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123)
[15:00:45] <wikibugs>	 (03PS4) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[15:02:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:02:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:02:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove access for jeh [puppet] - 10https://gerrit.wikimedia.org/r/870559
[15:05:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jeh [puppet] - 10https://gerrit.wikimedia.org/r/870559 (owner: 10Muehlenhoff)
[15:14:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:18:37] <wikibugs>	 (03PS5) 10Jbond: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:18:39] <wikibugs>	 (03PS1) 10Jbond: wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562
[15:19:06] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562 (owner: 10Jbond)
[15:19:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[15:20:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:20:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562 (owner: 10Jbond)
[15:21:32] <wikibugs>	 (03PS2) 10Jbond: wmcs::cloudb::haproxy: fix up spec tests [puppet] - 10https://gerrit.wikimedia.org/r/870562
[15:22:09] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add a Puppetfile to track vendored modules [puppet] - 10https://gerrit.wikimedia.org/r/869316 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[15:22:44] <wikibugs>	 (03CR) 10MVernon: "[setting self to CC so I know when I can rebase my CRs]" [puppet] - 10https://gerrit.wikimedia.org/r/870562 (owner: 10Jbond)
[15:24:15] <wikibugs>	 (03PS6) 10Jbond: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:26:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:26:04] <wikibugs>	 (03PS4) 10JHathaway: Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597)
[15:27:16] <wikibugs>	 (03PS3) 10MVernon: swift: move accounts_keys to common hiera [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123)
[15:27:38] <wikibugs>	 (03PS7) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[15:28:27] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Upgrade concat to v7.3.0 to support stdlib 8.X [puppet] - 10https://gerrit.wikimedia.org/r/869845 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[15:30:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon)
[15:31:17] <wikibugs>	 (03PS3) 10JHathaway: Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396)
[15:32:23] <wikibugs>	 (03CR) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:33:05] <wikibugs>	 (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:35:09] <wikibugs>	 (03PS8) 10MVernon: swift: add swift::rclone [puppet] - 10https://gerrit.wikimedia.org/r/870555 (https://phabricator.wikimedia.org/T162123)
[15:36:48] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 145 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:36:54] <wikibugs>	 (03CR) 10Jcrespo: dbbackups: Start backing up backup1-eqiad and backup1-codfw sections (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/870546 (https://phabricator.wikimedia.org/T313582) (owner: 10Jcrespo)
[15:38:24] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[15:38:43] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway)
[15:42:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway)
[15:43:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[15:50:53] <wikibugs>	 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10BCornwall)
[15:51:07] <wikibugs>	 10SRE, 10DC-Ops, 10Traffic-Icebox: Fix recdns config on various hardware devices - https://phabricator.wikimedia.org/T254178 (10BCornwall) 05Open→03Resolved Thanks for handling that, @ayounsi!
[15:51:17] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Add vendored module bodgit/puppet-postfix [puppet] - 10https://gerrit.wikimedia.org/r/868748 (https://phabricator.wikimedia.org/T325396) (owner: 10JHathaway)
[15:53:12] <wikibugs>	 (03CR) 10Elukey: sre.discovery.service-route: refactor to base/runner classes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[15:53:19] <wikibugs>	 (03PS13) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[15:53:21] <wikibugs>	 (03PS14) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[15:54:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[15:55:06] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[15:55:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) a:03BCornwall
[15:56:11] <wikibugs>	 (03PS14) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[15:56:13] <wikibugs>	 (03PS15) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[15:57:47] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[15:57:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) (owner: 10Elukey)
[15:59:08] <elukey>	 reallyyyyyy
[16:02:24] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Update Spicerack documentation - https://phabricator.wikimedia.org/T325754 (10fnegri)
[16:02:49] <wikibugs>	 (03PS16) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[16:08:43] <wikibugs>	 (03PS15) 10Elukey: sre.discovery.service-route: refactor to base/runner classes [cookbooks] - 10https://gerrit.wikimedia.org/r/869269 (https://phabricator.wikimedia.org/T277677)
[16:08:45] <wikibugs>	 (03PS17) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677)
[16:18:14] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri)
[16:20:11] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Spicerack: Add CI step to test with wmcs cookbooks - https://phabricator.wikimedia.org/T325758 (10fnegri)
[16:20:40] <wikibugs>	 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10wiki_willy) a:03Cmjohnson
[16:21:25] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[16:25:54] <wikibugs>	 (03PS1) 10JHathaway: concat: make compatible *again* with stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597)
[16:27:22] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38918/console" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[16:27:59] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1] "review kindly!" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[16:37:19] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.6 point update - https://phabricator.wikimedia.org/T325186 (10MoritzMuehlenhoff)
[16:38:46] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[16:39:05] <wikibugs>	 (03CR) 10JHathaway: [V: 03+1 C: 03+2] concat: make compatible *again* with stretch hosts [puppet] - 10https://gerrit.wikimedia.org/r/870640 (https://phabricator.wikimedia.org/T325597) (owner: 10JHathaway)
[16:50:57] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert)
[16:54:15] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38920/console" [puppet] - 10https://gerrit.wikimedia.org/r/870565 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert)
[16:54:54] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS Cookbook Automation Q2 tracking task - https://phabricator.wikimedia.org/T319401 (10fnegri)
[17:07:58] <icinga-wm>	 PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:08:16] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:10:22] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:19:26] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:32:36] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:33:08] <wikibugs>	 (03CR) 10Dzahn: "reading the latest ticket comments it sounds like this is not desired after all. should I just abandon?" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[17:33:32] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:34:11] <wikibugs>	 (03CR) 10Dzahn: "thanks! assuming no deployments are happening anyways during the holidays, I will merge this after code freeze then?" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn)
[17:34:22] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[17:36:36] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.058 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:37:16] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:37:27] <wikibugs>	 (03CR) 10Clément Goubert: mediawiki: download geoip databases on deployment servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[17:41:33] <wikibugs>	 10SRE, 10Gerrit, 10serviceops-collab: move gerrit.wm.org SSH service to private/behind LVS like phab-vcs - https://phabricator.wikimedia.org/T165631 (10Dzahn) @ayounsi It would mean a considerable effort to recreate an entire LVS service, which we just recently shut down for Phabricator in a lenghty decom pr...
[17:50:35] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert)
[17:59:21] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: reformat file [puppet] - 10https://gerrit.wikimedia.org/r/870665
[17:59:23] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: kubeadm: psp: base-pod-security-policies.yaml: allow hostPath volumes [puppet] - 10https://gerrit.wikimedia.org/r/870686 (https://phabricator.wikimedia.org/T325755)
[18:03:37] <wikibugs>	 (03CR) 10Dzahn: "no problem, sometimes things just change" [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[18:03:52] <wikibugs>	 (03Abandoned) 10Dzahn: mediawiki: download geoip databases on deployment servers [puppet] - 10https://gerrit.wikimedia.org/r/868199 (https://phabricator.wikimedia.org/T288375) (owner: 10Dzahn)
[18:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:22:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall)
[18:22:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) @XenoRyet can you grant approval for this access? Thanks!
[18:22:51] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Soda) >>! In T325607#8483180, @Dzahn wrote: > If you are asking for access to the Search Console, please clarify who needs access to what and add t...
[18:34:31] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "I don't think the non-privileged PSP should have access to data on the nodes directly." [puppet] - 10https://gerrit.wikimedia.org/r/870686 (https://phabricator.wikimedia.org/T325755) (owner: 10Arturo Borrero Gonzalez)
[18:40:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) p:05Triage→03Medium
[18:43:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10BCornwall) @Ottomata can you confirm if this also needs analytics-privatedata-users group membership without ssh and kerberos?
[18:52:39] <wikibugs>	 (03PS4) 10Jcrespo: miniloader: Draft small utilitiy to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383)
[19:04:29] <wikibugs>	 (03PS5) 10Jcrespo: miniloader: Draft small utility to load a mydumper dump in an emergency [software/wmfbackups] - 10https://gerrit.wikimedia.org/r/863264 (https://phabricator.wikimedia.org/T319383)
[19:32:33] <wikibugs>	 (03PS5) 10Vlad.shapik: Add ability to specify filters such as sharpening and etc. for TIFF format [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/863399 (https://phabricator.wikimedia.org/T325770)
[19:34:01] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Security-Team, 10SecTeam-Processed: Add Kelton Hurd to deployment and analytics-privatedata-users groups - https://phabricator.wikimedia.org/T323943 (10BCornwall) 05Stalled→03In progress a:03BCornwall
[19:48:15] <wikibugs>	 (03PS8) 10Vlad.shapik: Add the ability to specify the default DPI value for PDF files [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/853402 (https://phabricator.wikimedia.org/T325771)
[19:58:16] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Dzahn) >>! In T325607#8485221, @Soda wrote: > Any idea who might be the best person to contact regarding this ?   I didn't have an individual name....
[20:00:19] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Transfer ownership of Art+Feminism Wikimedians Mailing List to new moderators - https://phabricator.wikimedia.org/T325467 (10Dzahn) Is there actually a request for SRE in this?  I think not, it was just auto-tagged SRE by maintenance bot. Let us know if that's not correct and...
[20:01:15] <wikibugs>	 (03PS1) 10BCornwall: admin: Add kelhurd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/870708 (https://phabricator.wikimedia.org/T323943)
[20:04:08] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 134 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:05:44] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 2 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:13:02] <wikibugs>	 (03PS3) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660
[20:16:32] <icinga-wm>	 PROBLEM - Disk space on an-launcher1002 is CRITICAL: DISK CRITICAL - free space: / 1514 MB (2% inode=57%): /tmp 1514 MB (2% inode=57%): /var/tmp 1514 MB (2% inode=57%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops
[20:20:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall)
[20:23:25] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) a:03SCherukuwada
[20:23:43] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder)
[20:25:46] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10SCherukuwada) I need to do a couple of things to first make sure I have access to Wikisource in search console. As soon as that happens I'll dig ri...
[20:26:32] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:27:06] <wikibugs>	 10SRE, 10ops-eqsin, 10Traffic, 10decommission-hardware: decommission cp5001.eqsin.wmnet - https://phabricator.wikimedia.org/T319166 (10RobH) 05Open→03Resolved
[20:27:52] <wikibugs>	 10ops-eqsin: ManagementSSHDown - https://phabricator.wikimedia.org/T323970 (10RobH) 05Open→03Resolved a:03RobH invalid due to https://netbox.wikimedia.org/dcim/devices/2188/
[20:28:08] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[20:33:37] <logmsgbot>	 !log robh@cumin2002 START - Cookbook sre.dns.netbox
[20:34:22] <wikibugs>	 (03CR) 10SBassett: [C: 03+1] admin: Add kelhurd to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/870708 (https://phabricator.wikimedia.org/T323943) (owner: 10BCornwall)
[20:34:46] <logmsgbot>	 !log robh@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:35:01] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install eqsin refresh - https://phabricator.wikimedia.org/T322048 (10RobH)
[20:35:04] <wikibugs>	 10SRE, 10ops-eqsin, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q2:rack/setup/install/decom eqsin: unified decommission task - https://phabricator.wikimedia.org/T323830 (10RobH) 05Open→03Resolved a:03RobH confirmed all servers on this task are indeed decommissioned in netbox, removed the cable assignm...
[20:36:28] <wikibugs>	 10SRE, 10SRE-OnFire, 10Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Upgrade POPs asw to Junos 21 - https://phabricator.wikimedia.org/T316532 (10RobH)
[20:36:34] <wikibugs>	 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10netops, 10Wikimedia-Incident: asw1-eqsin: VC mastership change - https://phabricator.wikimedia.org/T323094 (10RobH) 05Open→03Stalled I'm setting this to stalled as the upgrade parent task should resolve this issue.
[20:37:10] <icinga-wm>	 RECOVERY - Disk space on an-launcher1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=an-launcher1002&var-datasource=eqiad+prometheus/ops
[20:45:41] <wikibugs>	 10SRE, 10All-and-every-Wikisource, 10Product-Analytics, 10SEO: Google not indexing Wikisource properly for years - https://phabricator.wikimedia.org/T325607 (10Darwinius) A single URL of a work I recently added to ws.pt, and on which I've been working in, appears to have been noticed by Google: https://pt....
[20:49:22] <wikibugs>	 10SRE-tools, 10Cloud-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Cumin/Openstack: multi-project commands are extremely slow - https://phabricator.wikimedia.org/T325773 (10Andrew)
[20:49:33] <wikibugs>	 (03PS4) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773)
[20:51:30] <wikibugs>	 (03CR) 10Andrew Bogott: Openstack backend: make use of all_tenants nova api flag (031 comment) [software/cumin] - 10https://gerrit.wikimedia.org/r/869332 (https://phabricator.wikimedia.org/T325773) (owner: 10Andrew Bogott)
[20:53:44] <wikibugs>	 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T321719 (10phaultfinder)
[20:59:38] <wikibugs>	 (03PS1) 10Bking: query_service: Allow query hosts to rsync data from clouddumps [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349)
[21:02:51] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/870714 (https://phabricator.wikimedia.org/T222349) (owner: 10Bking)
[21:03:21] <wikibugs>	 (03PS1) 10JHathaway: g10k cleanup [puppet] - 10https://gerrit.wikimedia.org/r/870717
[21:04:02] <wikibugs>	 (03PS2) 10JHathaway: g10k cleanup [puppet] - 10https://gerrit.wikimedia.org/r/870717
[21:10:09] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] g10k cleanup [puppet] - 10https://gerrit.wikimedia.org/r/870717 (owner: 10JHathaway)
[21:22:06] <wikibugs>	 (03CR) 10Ahmon Dancy: "Other than the readOnly issue, this works fine in train-dev where the pods end up with empty /usr/share/GeoIP* directories because the min" [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert)
[21:42:48] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/870535
[21:45:28] <wikibugs>	 10SRE, 10Wikimedia-Mailing-lists: Transfer ownership of Art+Feminism Wikimedians Mailing List to new moderators - https://phabricator.wikimedia.org/T325467 (10Masssly) No SRE request is needed. Thanks.
[22:08:33] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10BCornwall) a:05Wangombe→03BCornwall
[22:20:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:50:26] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[22:51:58] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[23:10:24] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid
[23:11:58] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid
[23:41:32] <icinga-wm>	 PROBLEM - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5030 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:43:00] <icinga-wm>	 RECOVERY - Ensure traffic_manager binds on 3128 and responds to HTTP requests on cp5030 is OK: HTTP OK: HTTP/1.1 200 Ok - 48238 bytes in 0.923 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server
[23:45:38] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:50:28] <icinga-wm>	 PROBLEM - BGP status on cr2-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.130 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status