[00:00:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:00:33] <jinxer-wm>	 (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:01:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:05:48] <jinxer-wm>	 (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:10:48] <jinxer-wm>	 (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError   - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError
[00:16:08] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:16:12] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[00:38:24] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957813
[00:38:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957813 (owner: 10TrainBranchBot)
[00:42:18] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:42:50] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:53:27] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957813 (owner: 10TrainBranchBot)
[01:15:02] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:37:42] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:49:18] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:54:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[01:55:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:58:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:01:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:02:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:06:31] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:19:37] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:28:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:30:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:34:37] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:38:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:57:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:58:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:58:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:03:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[03:07:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:08:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:25:23] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10Andrew) Background:  Each host has three dns servers, mdns (which is managed by openstack designate) pdns (auth for the outside world) and pdns-recursor (for VM reques...
[03:54:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[03:56:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:46:34] <icinga-wm>	 PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[04:49:24] <icinga-wm>	 RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1
[04:56:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:04:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:06:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:15:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:17:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:26:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:48:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:50:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:54:33] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151)
[05:59:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:00:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:15:01] <wikibugs>	 (03PS3) 10Ilias Sarantopoulos: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445)
[06:24:19] <wikibugs>	 (03PS4) 10Elukey: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[06:27:07] <wikibugs>	 (03PS5) 10Elukey: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[06:36:31] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:43:01] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10SLyngshede-WMF) 05Open→03Invalid Project is in development, see: https://phabricator.wikimedia.org/T189531
[06:53:41] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sretest1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958389
[06:57:47] <wikibugs>	 (03PS1) 10Ayounsi: dns: remove mentions of knams [dns] - 10https://gerrit.wikimedia.org/r/958390 (https://phabricator.wikimedia.org/T344579)
[06:59:20] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T0700).
[07:00:05] <jouncebot>	 Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:10] * Aca waves
[07:00:51] <taavi>	 good morning
[07:01:37] <Aca>	 Good morning, taavi! Are you deploying today?
[07:01:40] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957873 (https://phabricator.wikimedia.org/T346589) (owner: 10Acamicamacaraca)
[07:01:43] <taavi>	 yes
[07:02:02] <Aca>	 okie, nice, I'm around
[07:02:30] <wikibugs>	 (03Merged) 10jenkins-bot: robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957873 (https://phabricator.wikimedia.org/T346589) (owner: 10Acamicamacaraca)
[07:03:56] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:957873|robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki (T346589)]]
[07:04:03] <stashbot>	 T346589: Disable indexing user (talk) pages and draft (talk) pages on shwiki - https://phabricator.wikimedia.org/T346589
[07:04:54] <wikibugs>	 (03CR) 10Majavah: "Please remember that all patches merged to this repository must be pulled down to deploy1002 or the next deployer will get confused due to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[07:09:39] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957940 (owner: 10Slyngshede)
[07:13:20] <logmsgbot>	 !log taavi@deploy1002 aleksandar and taavi: Backport for [[gerrit:957873|robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki (T346589)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:13:23] <stashbot>	 T346589: Disable indexing user (talk) pages and draft (talk) pages on shwiki - https://phabricator.wikimedia.org/T346589
[07:13:29] <Aca>	 checking it now via Debug
[07:15:56] <moritzm>	 !log installing clamav security updates
[07:15:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:16:26] <Aca>	 Seems fine. According to page data, drafts and userpages now have indexing disallowed, which is expected.
[07:17:55] <taavi>	 thanks. syncing
[07:17:57] <logmsgbot>	 !log taavi@deploy1002 aleksandar and taavi: Continuing with sync
[07:18:21] <wikibugs>	 (03PS1) 10Majavah: typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392
[07:21:42] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 (owner: 10Majavah)
[07:26:21] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:957873|robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki (T346589)]] (duration: 22m 24s)
[07:26:24] <stashbot>	 T346589: Disable indexing user (talk) pages and draft (talk) pages on shwiki - https://phabricator.wikimedia.org/T346589
[07:26:27] <taavi>	 Aca: your patch is live. it might take a while for search engines to notice and remove any current pages from their indexes, but there's unfortunately not much we can do about that
[07:26:47] <Aca>	 Understandable! Thank you!
[07:26:49] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 (owner: 10Majavah)
[07:27:29] <wikibugs>	 (03Merged) 10jenkins-bot: typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 (owner: 10Majavah)
[07:30:30] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:30:58] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:32:26] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:33:42] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:34:30] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[07:35:31] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[07:37:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10SLyngshede-WMF) Rereading the answer for Juniper:  > For OIDC we’ll need your IDToken which would look like below or the IDP Issuer URL (This URL must be publicly accessible). >   S...
[07:38:06] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:44:03] <Amir1>	 taavi: I want to rebase a change in deploy1002? is that fine
[07:44:21] <taavi>	 Amir1: yes, I'm done deploying
[07:44:37] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Enable native MathML on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958054 (https://phabricator.wikimedia.org/T346584) (owner: 10Physikerwelt)
[07:44:41] <Amir1>	 awesome
[07:46:39] <wikibugs>	 (03Merged) 10jenkins-bot: Enable native MathML on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958054 (https://phabricator.wikimedia.org/T346584) (owner: 10Physikerwelt)
[07:51:26] <wikibugs>	 (03CR) 10Klausman: alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[07:51:53] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[07:52:24] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[07:57:46] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[07:57:59] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) Yes and no. The wmflib code could be improved to distinguish between a permission error and any other error and raise two differen...
[07:58:06] <Amir1>	 !log running db checksum run in s3 eqiad replicas (T207253)
[07:58:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:58:12] <stashbot>	 T207253: Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253
[08:01:28] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:02:00] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:14:05] <wikibugs>	 (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[08:14:44] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] clienthints: Enable purging of data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz)
[08:15:09] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] clienthints: Pin wgCheckUserDisplayClientHints to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz)
[08:18:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:19:03] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: refactor ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi)
[08:19:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: use timer name in journal [puppet] - 10https://gerrit.wikimedia.org/r/957847 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi)
[08:19:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: fix permissions on logs and 'lnms' [puppet] - 10https://gerrit.wikimedia.org/r/957846 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi)
[08:20:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:24:20] <icinga-wm>	 PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1679.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:25:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[08:25:17] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sretest1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958389 (owner: 10Muehlenhoff)
[08:26:27] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958398 (https://phabricator.wikimedia.org/T346445)
[08:28:24] <wikibugs>	 (03CR) 10Vgutierrez: varnishkafka: logrotate should use systemctl to reload rsyslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957995 (owner: 10Fabfur)
[08:28:45] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Fair enough re: manual removal, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[08:29:57] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399
[08:31:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support setting owner/group in assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957850 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi)
[08:31:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: snmp-exporter support for assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957851 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi)
[08:31:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use assemble-config for snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/957852 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi)
[08:32:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch ganeti-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958400
[08:33:52] <wikibugs>	 (03CR) 10Fabfur: [V: 03+1 C: 04-1] varnishkafka: logrotate should use systemctl to reload rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957995 (owner: 10Fabfur)
[08:35:02] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: fix old reference to prometheus-snmp-exporter-config [puppet] - 10https://gerrit.wikimedia.org/r/958401 (https://phabricator.wikimedia.org/T346335)
[08:35:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix old reference to prometheus-snmp-exporter-config [puppet] - 10https://gerrit.wikimedia.org/r/958401 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi)
[08:36:47] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] "lgtm. Curiosity, the access token for beta is set in private/PrivateSettings.php, how is that file handled? Are there any docs around?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm)
[08:40:39] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958398 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[08:41:17] <wikibugs>	 (03PS2) 10Slyngshede: gerrit: Link account creation to IDM. [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226)
[08:41:25] <wikibugs>	 (03CR) 10Slyngshede: gerrit: Link account creation to IDM. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede)
[08:41:58] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958398 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[08:42:16] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Excellent :)" [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede)
[08:43:13] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] gerrit: Link account creation to IDM. [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede)
[08:43:32] <wikibugs>	 (03PS2) 10Fabfur: varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602)
[08:43:42] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958400 (owner: 10Muehlenhoff)
[08:43:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur)
[08:44:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: let assemble-config write snmp.yml [puppet] - 10https://gerrit.wikimedia.org/r/958402 (https://phabricator.wikimedia.org/T346335)
[08:45:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: let assemble-config write snmp.yml [puppet] - 10https://gerrit.wikimedia.org/r/958402 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi)
[08:45:51] <wikibugs>	 (03CR) 10Fabfur: [C: 04-1] varnishkafka: logrotate should use systemctl to reload rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur)
[08:46:35] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[08:46:56] <wikibugs>	 (03PS3) 10Fabfur: varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602)
[08:47:50] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[08:48:00] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 (10Vgutierrez)
[08:48:06] <wikibugs>	 (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur)
[08:48:08] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Vgutierrez)
[08:50:27] <wikibugs>	 (03PS1) 10Filippo Giunchedi: benthos: more informative processor labels for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/958403 (https://phabricator.wikimedia.org/T346140)
[08:50:29] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826)
[08:51:07] <wikibugs>	 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10Performance Issue: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Tagging #sre for assistance with this issue, as it is definitely...
[08:53:08] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] aptrepo: Add Bookworm HAProxy third party repos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[08:55:41] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399
[08:56:29] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[08:56:54] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] dns: remove mentions of knams [dns] - 10https://gerrit.wikimedia.org/r/958390 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi)
[08:57:51] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:58:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 (owner: 10Ilias Sarantopoulos)
[08:59:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:02:32] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[09:02:54] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' .
[09:03:05] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' .
[09:03:15] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' .
[09:03:22] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' .
[09:03:31] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:03:40] <hashar>	 jouncebot: now
[09:03:40] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 56 minute(s)
[09:03:42] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:03:47] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[09:03:51] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' .
[09:03:56] <hashar>	 I am going to merge a change for Flow which only affects tests ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/957872 )
[09:04:44] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826)
[09:04:46] <wikibugs>	 (03PS1) 10JMeybohm: kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826)
[09:05:02] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] tests: Do not assume UTSysop exists [extensions/Flow] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957872 (https://phabricator.wikimedia.org/T346253) (owner: 10Urbanecm)
[09:05:18] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[09:06:09] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[09:06:59] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[09:07:10] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43330/console" [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:07:29] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43331/console" [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:07:49] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43332/console" [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:09:52] <wikibugs>	 (03Merged) 10jenkins-bot: tests: Do not assume UTSysop exists [extensions/Flow] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957872 (https://phabricator.wikimedia.org/T346253) (owner: 10Urbanecm)
[09:12:39] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 (owner: 10Ilias Sarantopoulos)
[09:13:26] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 (owner: 10Ilias Sarantopoulos)
[09:14:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:15:05] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: httpbb: fix ml-staging eswikiquote [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445)
[09:15:41] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[09:16:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:16:28] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:16:42] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[09:18:32] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service on Meta-Wiki - rollout #5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[09:21:18] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:23:57] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:24:29] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:24:57] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney)
[09:25:09] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
[09:25:10] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:957872|tests: Do not assume UTSysop exists (T346253)]]
[09:25:14] <stashbot>	 T346253: CannotCreateActorException: Cannot create an actor for a usable name that is not an existing user: user_name="U" - https://phabricator.wikimedia.org/T346253
[09:25:37] <wikibugs>	 (03CR) 10Elukey: httpbb: fix ml-staging eswikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[09:25:40] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[09:26:36] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: httpbb: fix ml-staging eswikiquote [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445)
[09:26:43] <logmsgbot>	 !log hashar@deploy1002 hashar and urbanecm: Backport for [[gerrit:957872|tests: Do not assume UTSysop exists (T346253)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[09:27:20] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: httpbb: fix ml-staging eswikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[09:27:22] <logmsgbot>	 !log hashar@deploy1002 hashar and urbanecm: Continuing with sync
[09:28:36] <godog>	 !log set max-repeaters for cr3-eqsin in librenms - T346606
[09:28:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:28:40] <stashbot>	 T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606
[09:28:43] <godog>	 !log set max-repeaters to 20 for cr3-eqsin in librenms - T346606
[09:28:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:29:42] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney)
[09:30:17] <wikibugs>	 (03Merged) 10jenkins-bot: Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney)
[09:30:17] <icinga-wm>	 PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:30:52] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] "\ο/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[09:31:59] <fabfur>	 !log disabled puppet on cp4052 for T346602
[09:32:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:32:12] <stashbot>	 T346602: VarnishKafka logrotate fails on bookworm  - https://phabricator.wikimedia.org/T346602
[09:34:17] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:957872|tests: Do not assume UTSysop exists (T346253)]] (duration: 09m 06s)
[09:34:17] <icinga-wm>	 RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27458 bytes in 0.235 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:34:21] <stashbot>	 T346253: CannotCreateActorException: Cannot create an actor for a usable name that is not an existing user: user_name="U" - https://phabricator.wikimedia.org/T346253
[09:38:53] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
[09:39:56] <fabfur>	 !log enabled puppet on cp4052 for T346602
[09:39:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:59] <stashbot>	 T346602: VarnishKafka logrotate fails on bookworm  - https://phabricator.wikimedia.org/T346602
[09:40:03] <fabfur>	 !log disabled puppet on cp4050 for T346602
[09:40:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:17] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:42:17] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[09:42:54] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] ":D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[09:43:10] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey)
[09:43:26] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[09:43:27] <elukey>	 jouncebot: next
[09:43:27] <jouncebot>	 In 0 hour(s) and 16 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1000)
[09:43:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:43:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:44:15] <fabfur>	 !log enabled puppet on cp4050 for T346602
[09:44:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:45:09] <wikibugs>	 (03PS1) 10Arnaudb: icinga: add my arnaudb [puppet] - 10https://gerrit.wikimedia.org/r/957815
[09:45:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 (10Fabfur) Cannot test the actual change with PCC but tested on two different hosts (cp4050 && cp4052) to check behavior.  The new logrotate configuration actually seems to rotate correc...
[09:46:05] <wikibugs>	 (03PS2) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815
[09:46:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:46:44] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[09:46:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[09:47:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:48:32] <wikibugs>	 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10jcrespo)
[09:48:42] <wikibugs>	 (03CR) 10Fabfur: "This has been tested on 2 different hosts in production (cp4050 and cp4052) and the behavior is the expected one." [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur)
[09:48:57] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:49:01] <wikibugs>	 (03PS3) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610)
[09:49:33] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur)
[09:49:37] <wikibugs>	 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10jcrespo) a:03ABran-WMF
[09:49:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[09:49:42] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[09:49:46] <wikibugs>	 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10jcrespo) p:05Triage→03High
[09:49:48] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[09:49:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[09:49:57] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly
[09:50:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[09:50:27] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0)
[09:50:48] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[09:50:48] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite
[09:50:50] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0)
[09:51:48] <elukey>	 kamila_: o/ are you testing? Wondering if it is ok for me to continue deploying changeprop or not
[09:52:15] <kamila_>	 elukey: ah, sorry, I'm done now
[09:52:47] <kamila_>	 (did you also want to make use of the space between deployment windows? :D) 
[09:53:13] <elukey>	 :D
[09:53:17] <elukey>	 okok proceeding
[09:53:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:53:59] <icinga-wm>	 PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (207534s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static
[09:54:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync
[09:54:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync
[09:55:01] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[09:55:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:56:02] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[09:56:58] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "ticket-test.wm.o isn't a valid SAN on the backend TLS certificate" [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[09:58:31] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: dbutils: introduce statement define [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603)
[09:59:04] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: dbutils: introduce statement define [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603)
[09:59:31] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603) (owner: 10Arturo Borrero Gonzalez)
[09:59:33] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Adding jbond and Muehlenhoff for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[09:59:53] <elukey>	 !log remove ores-cache stream from changeprop (side effects - higher ORES client latencies, no mediawiki.revision-score event stream published) - https://phabricator.wikimedia.org/T342116
[09:59:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1000)
[10:03:06] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] httpbb: fix ml-staging eswikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[10:03:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dbutils: introduce statement define [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603) (owner: 10Arturo Borrero Gonzalez)
[10:05:58] <wikibugs>	 (03PS1) 10Jelto: trafficserver: switch static-codereview.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309)
[10:06:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] benthos: more informative processor labels for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/958403 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[10:07:24] <wikibugs>	 (03PS3) 10Kamila Součková: wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330)
[10:07:43] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: more informative processor labels for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/958403 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi)
[10:10:34] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[10:15:01] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "please check the TLS material used in the backend side:" [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[10:15:17] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey)
[10:16:26] <wikibugs>	 (03PS6) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041)
[10:16:28] <wikibugs>	 (03PS6) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041)
[10:16:30] <wikibugs>	 (03PS6) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041)
[10:16:32] <wikibugs>	 (03PS6) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041)
[10:16:34] <wikibugs>	 (03PS6) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[10:16:36] <wikibugs>	 (03PS1) 10Brouberol: Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041)
[10:17:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey)
[10:19:52] <wikibugs>	 (03CR) 10Brouberol: "I attempted to fix the node regular expression by getting rid of the `[01-10]` range that didn't seem to work, and rebased all other chang" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[10:19:58] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[10:20:11] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:20:45] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] idp: add datahub as oidc service [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene)
[10:28:02] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on P{maps200[5,6].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad)
[10:28:45] <wikibugs>	 (03PS2) 10Brouberol: Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041)
[10:28:47] <wikibugs>	 (03PS7) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041)
[10:28:49] <wikibugs>	 (03PS7) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041)
[10:28:51] <wikibugs>	 (03PS7) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041)
[10:28:53] <wikibugs>	 (03PS7) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041)
[10:28:55] <wikibugs>	 (03PS7) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[10:29:07] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:29:21] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[10:29:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[10:33:02] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge" [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[10:33:17] <wikibugs>	 (03CR) 10Kamila Součková: "adding -2 for now to avoid accidental merge" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[10:33:53] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge (for real this time :D)" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[10:33:58] <godog>	 !log set max-repeaters to 20 for cr3-eqsin using "force save" - T346606
[10:34:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:02] <stashbot>	 T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606
[10:34:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[10:36:10] <wikibugs>	 (03CR) 10Jelto: "looping in jayme for some more ingress certificate insights." [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[10:36:31] <jinxer-wm>	 (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:36:38] <wikibugs>	 (03CR) 10Filippo Giunchedi: "CI isn't happy about the commit message, config change LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[10:40:29] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main
[10:41:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:41:30] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:42:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:42:25] <wikibugs>	 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10fnegri) @M2k_dewiki the Kubernetes pod was stuck, I restarted it manually with `webservice stop` followed by `webservice start`, and https://templatetransclusionc...
[10:42:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on P{maps200[5,6].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad)
[10:42:28] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[10:42:55] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] trafficserver: switch static-codereview.wikimedia.org to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[10:44:33] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main
[10:44:54] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:44:54] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main
[10:46:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on P{maps200[7,8].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad)
[10:46:48] <wikibugs>	 (03PS3) 10Brouberol: Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041)
[10:46:50] <wikibugs>	 (03PS8) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041)
[10:46:52] <wikibugs>	 (03PS8) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041)
[10:46:54] <wikibugs>	 (03PS8) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041)
[10:46:56] <wikibugs>	 (03PS8) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041)
[10:46:58] <wikibugs>	 (03PS8) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[10:47:06] <wikibugs>	 10SRE, 10Traffic: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 (10Vgutierrez) 05Open→03Resolved a:03Fabfur
[10:47:14] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Vgutierrez)
[10:47:48] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Aklapper) >>! In T335879#9173531, @Volans wrote: > The wmflib code could be improved to distinguish between a permission error and any oth...
[10:48:23] <logmsgbot>	 !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main
[10:49:18] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[10:59:12] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) @Aklapper What I meant is that there is no way to distinguish between the "no access" error and any other error that could be a mi...
[11:01:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on P{maps200[7,8].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad)
[11:05:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on P{maps201[0].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad)
[11:13:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on P{maps201[0].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad)
[11:14:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on A:maps-replica-eqiad
[11:15:46] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Aklapper) >>! In T335879#9174123, @Volans wrote: > It's just the message that differ, that is something wmflib should not rely on because...
[11:16:49] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert)
[11:23:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Move Ganeti to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/957865 (owner: 10Muehlenhoff)
[11:30:50] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch ganeti-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958400
[11:32:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:34:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958400 (owner: 10Muehlenhoff)
[11:34:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:37:08] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1005.wikimedia.org
[11:42:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:43:00] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[11:43:08] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:44:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:44:41] <jayme>	 !log removed cergen certs from the list of trusted service account token signers on all kubernetes clusters - T329826
[11:44:42] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.dns.netbox
[11:44:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:44] <stashbot>	 T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826
[11:45:09] <logmsgbot>	 !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1005 - aborrero@cumin1001"
[11:45:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:46:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:46:02] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1005 - aborrero@cumin1001"
[11:46:02] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:46:11] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:46:12] <logmsgbot>	 !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudservices1005.wikimedia.org
[11:47:05] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: `cloudservices1005.wikimedia.org` - cloudservices10...
[11:50:35] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero)
[11:53:30] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1140.eqiad.wmnet with OS bullseye
[11:54:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on A:maps-replica-eqiad
[11:54:38] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST revisions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:56:22] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43333/console" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[11:56:33] <wikibugs>	 (03PS7) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648)
[11:58:03] <wikibugs>	 (03PS1) 10Kamila Součková: db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474)
[11:58:19] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] trafficserver: switch static-codereview.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[11:58:31] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[11:59:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (LIST clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[12:00:08] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) Thanks!, still blocked on @thcipriani for deployment group membership
[12:00:58] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43334/console" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[12:02:08] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Ensure that logs are created with correct permissions. [puppet] - 10https://gerrit.wikimedia.org/r/957940 (owner: 10Slyngshede)
[12:02:38] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463
[12:02:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:03:12] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge" [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:03:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye
[12:03:36] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474)
[12:04:02] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "To be pushed after the switch" [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui)
[12:05:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Kamila the reason why es1, es2 and es3 aren't in Orchestrator is because they are standalone hosts and orchestrator doesn't support that (" [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:06:38] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: drop references to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042)
[12:06:41] <wikibugs>	 (03PS4) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610)
[12:07:09] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1140.eqiad.wmnet with reason: host reimage
[12:07:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[12:07:43] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop references to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:08:03] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1141.eqiad.wmnet with OS bullseye
[12:08:15] <wikibugs>	 (03CR) 10Majavah: openstack: drop references to cloudcontrol1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:08:19] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] db: Switch DNS master alias to codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:10:11] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1140.eqiad.wmnet with reason: host reimage
[12:13:05] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: remove overrides for designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/958467 (https://phabricator.wikimedia.org/T346042)
[12:13:32] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] openstack: remove overrides for designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/958467 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:13:34] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop references to cloudcontrol1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:13:45] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: remove overrides for designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/958467 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[12:18:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[12:19:02] <wikibugs>	 (03PS2) 10Clément Goubert: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[12:19:11] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[12:20:47] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "Those es1-es3 aren't correct, but we don't really use them that much as "master" as they are all masters really." [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:21:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage
[12:21:30] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) 05Open→03Resolved a:03fnegri This was fixed by @taavi in https://gerrit.wikimedia.org/r/c...
[12:21:36] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1141.eqiad.wmnet with reason: host reimage
[12:23:42] <moritzm>	 !log installing libwebp security updates on bullseye
[12:23:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:23:54] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey)
[12:24:00] <logmsgbot>	 !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1140.eqiad.wmnet with OS bullseye
[12:24:34] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1141.eqiad.wmnet with reason: host reimage
[12:26:41] <wikibugs>	 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from CloudVPS instances - https://phabricator.wikimedia.org/T343335 (10fnegri) 05Open→03Resolved a:03fnegri Similarly to T343336, this was also fixed by @taav...
[12:27:15] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: furud.codfw.wmnet
[12:27:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: furud.codfw.wmnet
[12:28:40] <wikibugs>	 (03CR) 10Jcrespo: "Let me know what do you think for an amend 😊" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[12:29:31] <wikibugs>	 (03PS2) 10Kamila Součková: db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474)
[12:29:50] <wikibugs>	 (03CR) 10Kamila Součková: db: Switch DNS master alias to codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:30:39] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[12:32:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all
[12:33:16] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[12:34:12] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[12:36:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:36:38] <wikibugs>	 (03CR) 10Fabfur: [V: 03+2 C: 03+2] add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur)
[12:36:50] <wikibugs>	 (03CR) 10Brouberol: "Seems like pcc can't run on the kafka-jumbo hosts given that the previous change request broke the node -> role assignment." [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[12:37:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:37:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye
[12:37:58] <wikibugs>	 (03PS1) 10Kamila Součková: wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474)
[12:38:39] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge" [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:40:46] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:41:03] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "\o/" [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[12:41:22] <wikibugs>	 (03PS1) 10JMeybohm: chromium-render: Update to use certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033)
[12:42:10] <wikibugs>	 (03PS2) 10JMeybohm: Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033)
[12:43:07] <wikibugs>	 (03PS1) 10Jelto: miscweb/microsites: move monitoring of static-codereview to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/958474 (https://phabricator.wikimedia.org/T346309)
[12:43:09] <wikibugs>	 (03PS1) 10Jelto: miscweb/microsites: remove static-codereview resources [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309)
[12:44:48] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) Thanks for the context @Andrew, I was thinking it was something like that thanks for filling in the gaps.  I guess the big question I have is there any way to...
[12:46:42] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) hey @Jclark-ctr (or @VRiley-WMF) this server should be ready to be re-racked into rack `D5`.
[12:47:51] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1141.eqiad.wmnet with OS bullseye
[12:48:08] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all
[12:48:17] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[12:48:33] <wikibugs>	 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) >>! In T346385#9173361, @Andrew wrote: > That 10. address is in the current pool config. It's probably wrong, but also everything is changing constantly so I'...
[12:52:42] <wikibugs>	 (03CR) 10Gmodena: cirrus: add the mediawiki.cirrussearch.page_rerender stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[12:54:07] <wikibugs>	 (03PS4) 10Anzx: add extranamespacenames for kannada-kn language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958050 (https://phabricator.wikimedia.org/T346583)
[12:54:09] <wikibugs>	 (03PS1) 10JMeybohm: Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033)
[12:54:31] <wikibugs>	 (03PS2) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582)
[12:54:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[12:55:56] <wikibugs>	 (03PS2) 10JMeybohm: Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1300).
[13:00:05] <jouncebot>	 cormacparle and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:20] * cormacparle waves
[13:01:51] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:01:55] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:02:17] <godog>	 !log set max-repeaters to 30 for cr3-eqsin in librenms - T346606
[13:02:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:02:29] <stashbot>	 T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606
[13:02:50] <taavi>	 hey. I can deploy
[13:02:55] <wikibugs>	 (03PS5) 10Majavah: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle)
[13:02:55] <aanzx>	 o/
[13:03:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:03:11] <wikibugs>	 (03PS1) 10Fabfur: add simple Makefile [software/purged] - 10https://gerrit.wikimedia.org/r/958477
[13:03:15] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila)
[13:03:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle)
[13:04:06] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:04:14] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:04:25] <wikibugs>	 (03Merged) 10jenkins-bot: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle)
[13:04:27] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:04:42] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:955967|Disable UploadWizard CTA for MachineVision (T345187)]]
[13:04:53] <stashbot>	 T345187: [Spike] Figure out what's involved in turning MachineVision off - https://phabricator.wikimedia.org/T345187
[13:06:07] <logmsgbot>	 !log taavi@deploy1002 taavi and cparle: Backport for [[gerrit:955967|Disable UploadWizard CTA for MachineVision (T345187)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:06:16] <taavi>	 cormacparle: please test your patch
[13:06:23] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957996 (https://phabricator.wikimedia.org/T346472) (owner: 10Kamila Součková)
[13:06:24] <cormacparle>	 👍
[13:07:38] <wikibugs>	 (03PS1) 10Muehlenhoff: sre.maps.reboot: Retire legacy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855)
[13:08:37] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:09:06] <cormacparle>	 taavi: seems fine, thank you
[13:09:11] <taavi>	 thanks, syncing
[13:09:13] <logmsgbot>	 !log taavi@deploy1002 taavi and cparle: Continuing with sync
[13:09:17] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM, kind reminder to follow after merging:" [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff)
[13:10:06] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:10:43] <taavi>	 aanzx: hey. backports (changes to a mediawiki/* repository) generally should have a +2 before being scheduled for deployment (and preferrably go out via the train, unless they're particularly urgent). I'm not familiar with the languages system, but I added some people as reviewers
[13:11:18] <vgutierrez>	 !log depool cp4052 for bookworm testing - T342154
[13:11:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:11:30] <stashbot>	 T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154
[13:12:28] <taavi>	 aanzx: for the last patch, why is wmgUseWikidataPageBanner being changed?
[13:12:35] <aanzx>	 taavi: ok  then https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/958050 patch can also be scheduled for deployment later
[13:15:59] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955967|Disable UploadWizard CTA for MachineVision (T345187)]] (duration: 11m 16s)
[13:16:02] <stashbot>	 T345187: [Spike] Figure out what's involved in turning MachineVision off - https://phabricator.wikimedia.org/T345187
[13:16:15] <cormacparle>	 thanks taavi !
[13:16:53] <aanzx>	 taavi: for enabling minervasite notice wmgUseWikidataPageBanner is not required?
[13:18:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:19:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:22:36] <wikibugs>	 (03PS3) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582)
[13:22:42] <taavi>	 aanzx: I don't know. from the comment it seems like WikidataPageBanner requires $wgMinervaEnableSiteNotice to be true, but I don't know about the opposite. it'd be a good idea to ask someone who knows
[13:22:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx)
[13:24:17] <aanzx>	 taavi: ok i will schedule this patch also for later
[13:24:24] <taavi>	 thanks!
[13:24:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public
[13:25:28] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[13:25:52] <wikibugs>	 (03Abandoned) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx)
[13:26:30] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public
[13:29:43] <wikibugs>	 (03PS1) 10JMeybohm: Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033)
[13:29:45] <wikibugs>	 (03CR) 10Elukey: alertmanager: create ml team alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[13:29:59] <wikibugs>	 (03CR) 10Nikerabbit: [C: 03+1] Enable MinT translation service on Meta-Wiki - rollout #5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro)
[13:31:54] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "looks good. I'm also not sure about the additional SANs. I'd also think the additional SANs are terminated by the ingress. But I can test " [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[13:37:51] <Kizule>	 Hi, is UTC afternoon backport window still in progress?
[13:38:53] <godog>	 !log force-set max-repeaters to 20 for cr2-eqsin and cr3-eqsin - T346606
[13:38:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:38:57] <wikibugs>	 (03CR) 10Herron: [V: 03+1] "are we still in a holding pattern on this one?" [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron)
[13:38:57] <stashbot>	 T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606
[13:39:06] <wikibugs>	 (03PS1) 10Muehlenhoff: Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497)
[13:39:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:42:50] <Kizule>	 Nevermind, I'll add my patch for next one.
[13:43:58] <wikibugs>	 (03PS2) 10Muehlenhoff: Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497)
[13:44:21] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "No we can go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron)
[13:45:54] <wikibugs>	 (03PS2) 10Zoranzoki21: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391)
[13:46:14] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm
[13:47:13] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet
[13:48:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[13:48:38] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[13:51:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet
[13:56:55] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage
[13:57:30] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr) @aborrero. I have moved server physically and in netbox.  i did not delete any interfaces out of netbox  new Cableid. 20220117  port 4...
[13:57:36] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet
[14:00:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) 05Open→03Declined >>! In T252890#9165519, @ayounsi wrote: > @CDanis  Is that still needed now that we have NEL?  It would be interesting t...
[14:00:29] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] wikireplicas: restore pybal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/924508 (https://phabricator.wikimedia.org/T337446) (owner: 10BBlack)
[14:00:38] <wikibugs>	 (03PS1) 10JMeybohm: Copy mesh.certificate_1.0.0 to 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/958482 (https://phabricator.wikimedia.org/T300033)
[14:00:40] <wikibugs>	 (03PS1) 10JMeybohm: mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033)
[14:00:42] <wikibugs>	 (03PS1) 10JMeybohm: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033)
[14:01:31] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet
[14:01:37] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage
[14:01:47] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet
[14:02:52] <wikibugs>	 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T346450 (10Jhancock.wm) a:03Jhancock.wm
[14:04:17] <bblack>	 !log lvs1020, lvs1018: restarting pybal to re-enable healthchecks for wikireplicas ( T337446 -> https://gerrit.wikimedia.org/r/924508 )
[14:04:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:04:28] <stashbot>	 T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446
[14:05:41] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet
[14:06:05] <wikibugs>	 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T346387 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated powersupply  cleared fault
[14:06:31] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:07:26] <wikibugs>	 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr)
[14:08:29] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:09:06] <wikibugs>	 (03PS1) 10Btullis: Add an nginx rule to block scripts from repositorygroup paths [puppet] - 10https://gerrit.wikimedia.org/r/958486 (https://phabricator.wikimedia.org/T318962)
[14:09:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:09:39] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:33] <wikibugs>	 (03PS3) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890)
[14:13:15] <wikibugs>	 (03PS4) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638)
[14:13:18] <wikibugs>	 (03CR) 10DCausse: rdf-streaming-updater: start adding per-env ZK path root (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[14:13:20] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply
[14:13:34] <wikibugs>	 (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[14:15:02] <wikibugs>	 (03PS3) 10Elukey: modules: copy configuration 1.4.1 to 1.5.0 for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440 (https://phabricator.wikimedia.org/T346638)
[14:15:13] <wikibugs>	 (03PS4) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638)
[14:15:19] <logmsgbot>	 !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[14:15:28] <wikibugs>	 (03PS5) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638)
[14:16:31] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:55] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709)
[14:17:57] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709)
[14:18:21] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709)
[14:18:34] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm
[14:18:45] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS bookworm
[14:19:37] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:20:20] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "deploy in staging looks good, proceeding with codfw and eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[14:20:52] <wikibugs>	 (03PS2) 10Fabfur: add simple Makefile and README [software/purged] - 10https://gerrit.wikimedia.org/r/958477
[14:21:33] <wikibugs>	 (03PS3) 10Fabfur: add simple Makefile and README [software/purged] - 10https://gerrit.wikimedia.org/r/958477
[14:21:41] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[14:22:14] <icinga-wm>	 PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:23:20] <icinga-wm>	 PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[14:24:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[14:26:31] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:26:44] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[14:27:40] <wikibugs>	 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T346450 (10Jhancock.wm) 05Open→03Resolved
[14:27:59] <wikibugs>	 (03PS1) 10Jclark-ctr: add dbstore1008-1009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862)
[14:28:46] <wikibugs>	 (03CR) 10Jclark-ctr: [C: 03+2] add dbstore1008-1009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) (owner: 10Jclark-ctr)
[14:29:04] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS bullseye
[14:29:58] <logmsgbot>	 !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[14:32:17] <wikibugs>	 (03PS5) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610)
[14:32:51] <jelto>	 !log use certmanager instead of certgen in miscweb namespace - T300033
[14:32:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:54] <stashbot>	 T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033
[14:34:13] <wikibugs>	 (03CR) 10Jcrespo: icinga: add arnaudb to userlist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[14:38:30] <logmsgbot>	 !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage
[14:39:44] <wikibugs>	 (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[14:41:14] <wikibugs>	 10SRE, 10Phabricator, 10Security-Team, 10SecTeam-Processed, and 2 others: Require 2FA for members of acl*sre-team - https://phabricator.wikimedia.org/T328746 (10sbassett) 05In progress→03Resolved a:03Reedy >>! In T328746#9171419, @RLazarus wrote: > I don't have edit access to #acl_security.  Thanks f...
[14:41:36] <logmsgbot>	 !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage
[14:41:58] <wikibugs>	 10SRE, 10Phabricator, 10Security-Team, 10SecTeam-Processed, and 2 others: Require 2FA for members of acl*sre-team - https://phabricator.wikimedia.org/T328746 (10sbassett)
[14:42:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Milimetric) approved  (sorry this slipped through)
[14:42:44] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage
[14:44:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[14:44:52] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[14:45:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye
[14:45:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye
[14:45:42] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye
[14:45:46] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage
[14:45:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[14:45:54] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[14:47:56] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS bullseye
[14:52:39] <wikibugs>	 (03PS2) 10Ilias Sarantopoulos: alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151)
[14:54:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye
[14:54:25] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye
[14:54:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye
[14:54:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10...
[14:55:02] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[14:57:02] <wikibugs>	 (03PS1) 10Brouberol: [eventgate-analytics-external] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041)
[14:57:04] <wikibugs>	 (03PS1) 10Brouberol: [eventgate-analytics] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958497 (https://phabricator.wikimedia.org/T33604)
[14:57:06] <wikibugs>	 (03PS1) 10Brouberol: [eventstream-internal] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958498 (https://phabricator.wikimedia.org/T336041)
[14:57:08] <wikibugs>	 (03PS1) 10Brouberol: [mw-page-content-change-enrich] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958499 (https://phabricator.wikimedia.org/T336041)
[14:58:18] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:58:58] <wikibugs>	 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API not returning latest page when queried title is a redirect - https://phabricator.wikimedia.org/T346579 (10akosiaris) I 'll admit I am a bit stumped here. This is clearly not the CDN's fault as RESTBase exhibits the same behavior while also violating wh...
[15:01:23] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage
[15:03:18] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:04:20] <wikibugs>	 (03PS3) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027)
[15:04:27] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage
[15:04:38] <icinga-wm>	 RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[15:04:44] <icinga-wm>	 RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[15:11:21] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1142.eqiad.wmnet with OS bullseye
[15:13:56] <Emperor>	 !log upload swift_2.26.0-10+deb11u1+wmf1_amd64.changes to apt1001
[15:13:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:18:56] <Emperor>	 !log depool ms-fe2009 to install new swift packages
[15:18:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:21:56] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-stats_mw-media.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:23:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add an nginx rule to block scripts from repositorygroup paths [puppet] - 10https://gerrit.wikimedia.org/r/958486 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis)
[15:24:47] <wikibugs>	 (03PS9) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041)
[15:24:49] <wikibugs>	 (03PS9) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041)
[15:24:51] <wikibugs>	 (03PS9) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041)
[15:24:53] <wikibugs>	 (03PS9) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041)
[15:24:55] <wikibugs>	 (03PS9) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[15:25:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "@Filippo (godog): when testing the change, we can see the new user on the config file, but I think we were bitten (again) by https://phabr" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[15:25:48] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:25:55] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1143.eqiad.wmnet with OS bullseye
[15:26:41] <Emperor>	 !log repool ms-fe2009 with new swift packages
[15:26:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:27:14] <Emperor>	 !log install new swift packages on ms-be2044
[15:27:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:28:35] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] icinga: add arnaudb to userlist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[15:28:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1036
[15:29:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye
[15:29:39] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye
[15:29:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye
[15:29:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye
[15:29:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1036
[15:30:05] <jouncebot>	 jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1530).
[15:30:45] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage
[15:30:47] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage
[15:31:12] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958512 (https://phabricator.wikimedia.org/T128546)
[15:31:25] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958512 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:31:56] <wikibugs>	 (03CR) 10SBassett: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg)
[15:32:15] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958512 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[15:32:52] <wikibugs>	 (03PS2) 10Brouberol: [eventgate-analytics] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958497 (https://phabricator.wikimedia.org/T336041)
[15:32:54] <wikibugs>	 (03PS2) 10Brouberol: [eventstream-internal] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958498 (https://phabricator.wikimedia.org/T336041)
[15:32:56] <wikibugs>	 (03PS2) 10Brouberol: [mw-page-content-change-enrich] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958499 (https://phabricator.wikimedia.org/T336041)
[15:34:51] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage
[15:36:52] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage
[15:40:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:43:34] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS bullseye
[15:44:29] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:958512| Bumping portals to master (T128546)]] (duration: 08m 45s)
[15:44:32] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:45:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:45:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) pc1016 - C 6. U 31. port 30 CableID 3252
[15:46:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:49:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:50:03] <jinxer-wm>	 (KubernetesAPILatency) firing: (5) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:51:21] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:51:54] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye
[15:51:56] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron)
[15:52:02] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye
[15:53:00] <logmsgbot>	 !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:958512| Bumping portals to master (T128546)]] (duration: 08m 31s)
[15:53:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:53:01] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1038.eqiad.wmnet with OS bullseye
[15:53:04] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage
[15:53:04] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[15:53:11] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye completed: - kubernetes1038 (**PAS...
[15:53:29] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:55:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:55:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] [eventgate-analytics-external] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[15:56:14] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage
[15:57:01] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage
[15:57:04] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[15:57:10] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1047.eqiad.wmnet with OS bullseye
[15:57:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye completed: - kubernetes1047 (**PAS...
[15:58:27] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:59:29] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:59:52] <wikibugs>	 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero)
[15:59:59] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba)
[16:00:31] <wikibugs>	 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero)
[16:01:10] <wikibugs>	 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero)
[16:01:20] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero)
[16:01:37] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage
[16:02:09] <wikibugs>	 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero)
[16:02:35] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS bullseye
[16:03:16] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero)
[16:05:53] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[16:09:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero)
[16:10:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero)
[16:11:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:11:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero)
[16:12:13] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero)
[16:12:59] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.61.1" for 601 hosts
[16:13:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[16:13:33] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1036.eqiad.wmnet with OS bullseye
[16:13:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye completed: - kubernetes1036 (**PAS...
[16:14:07] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.61.1" completed for 601 hosts
[16:14:46] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) These servers are going to be part of the `eqiad2dev` deployment, and should get the `-dev`prefix on them,...
[16:15:32] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage
[16:16:01] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm)
[16:17:55] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage
[16:23:24] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbstore1008']
[16:23:28] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbstore1009']
[16:23:52] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbstore1009']
[16:24:19] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbstore1009']
[16:25:03] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1144.eqiad.wmnet with OS bullseye
[16:28:11] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:29:13] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbstore1008']
[16:30:12] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbstore1009']
[16:39:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:41:27] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1145.eqiad.wmnet with OS bullseye
[16:42:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:42:09] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[16:43:04] <wikibugs>	 (03CR) 10RobH: [C: 03+1] "I'm not sure if that role is going to work as its not specifically defined in the manifests for insetup role but we can give it a shot and" [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) (owner: 10Jclark-ctr)
[16:46:11] <wikibugs>	 (03CR) 10Volans: "post-merge -1, this role doesn't exists, the reimage will fail because puppet will fail" [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) (owner: 10Jclark-ctr)
[16:52:01] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1700)
[17:00:05] <jouncebot>	 ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1700). nyaa~
[17:18:25] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409 (10ssingh) a:05ssingh→03None
[17:18:36] <wikibugs>	 10SRE, 10Privacy Engineering, 10Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409 (10ssingh) a:03ssingh
[17:21:06] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) a:05kostajh→03None
[17:21:58] <wikibugs>	 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) https://wikitech.wikimedia.org/wiki/Tool:Fix_Suggester_Bot got some of the way there, but I do...
[17:23:17] <wikibugs>	 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10akosiaris) a:05akosiaris→03None
[17:32:19] <wikibugs>	 (03PS2) 10Brouberol: Add kafka-jumbo1010.eqiad.wmnet to apps config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041)
[17:33:16] <wikibugs>	 (03PS1) 10RobH: dbstore insetup role adjustment [puppet] - 10https://gerrit.wikimedia.org/r/958531 (https://phabricator.wikimedia.org/T342862)
[17:33:37] <wikibugs>	 (03CR) 10RobH: [C: 03+2] dbstore insetup role adjustment [puppet] - 10https://gerrit.wikimedia.org/r/958531 (https://phabricator.wikimedia.org/T342862) (owner: 10RobH)
[17:40:07] <wikibugs>	 10SRE, 10MediaWiki-Documentation, 10serviceops-radar, 10Documentation, and 2 others: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dereckson) Who are the ones responsible for this review?
[17:46:25] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:46:55] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[17:49:01] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:51:08] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] sre.maps.reboot: Retire legacy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff)
[17:55:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:55:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:56:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:57:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:58:05] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:59:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:12:57] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Deployments, 10serviceops-radar, 10Release-Engineering-Team (Radar), and 2 others: Remove provisioning for 'mwscript', 'foreachwikiindblist' etc from deployment host - https://phabricator.wikimedia.org/T253822 (10dancy) a:05dancy→03None
[18:29:37] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:30:58] <wikibugs>	 (03CR) 10Urbanecm: beta: Do not reference image-suggestion-api.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm)
[18:33:56] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Revert "install: Use from-scratch partman recipe for restbase1030" [puppet] - 10https://gerrit.wikimedia.org/r/956063 (owner: 10Eevans)
[18:40:01] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes
[18:42:23] <wikibugs>	 (03PS1) 10Milimetric: wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679)
[18:42:33] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] cassandra: remove cassandra/twcs deployment [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans)
[18:45:12] <wikibugs>	 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) a:05RLazarus→03None
[18:45:18] <wikibugs>	 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) a:03RLazarus
[18:45:34] <wikibugs>	 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) a:05RLazarus→03None
[18:45:44] <wikibugs>	 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) a:03RLazarus
[19:10:02] <wikibugs>	 (03PS2) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027)
[19:10:04] <wikibugs>	 (03PS1) 10AOkoth: vrts: vrts1002 change global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/958565 (https://phabricator.wikimedia.org/T340027)
[19:14:10] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add kafka-jumbo1010.eqiad.wmnet to apps config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[19:14:15] <wikibugs>	 (03PS2) 10AOkoth: vrts: vrts1002 change global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/958565 (https://phabricator.wikimedia.org/T340027)
[19:15:03] <wikibugs>	 (03Merged) 10jenkins-bot: Add kafka-jumbo1010.eqiad.wmnet to apps config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[19:22:00] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dbstore1008.eqiad.wmnet with OS bullseye
[19:22:02] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dbstore1009.eqiad.wmnet with OS bullseye
[19:22:06] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye
[19:22:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye
[19:34:06] <wikibugs>	 (03PS6) 10Ilias Sarantopoulos: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445)
[19:38:34] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[19:39:40] <wikibugs>	 (03PS7) 10Ilias Sarantopoulos: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445)
[19:41:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[19:42:21] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[19:43:49] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[19:45:27] <icinga-wm>	 RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[19:46:32] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' .
[19:50:48] <wikibugs>	 10SRE: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10dr0ptp4kt)
[19:51:38] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:56:16] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba)
[19:56:38] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:57:51] <wikibugs>	 (03CR) 10Sergio Gimeno: [C: 03+1] Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm)
[20:00:06] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T2000). Please do the needful.
[20:00:06] <jouncebot>	 Dreamy_Jazz, Kizule, and Sergi0: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:17] <wikibugs>	 (03CR) 10Bking: "@btullis @ryankemper are y'all OK with this approach? I've used the data-engineering team as a contact until we figure out contacts in T34" [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[20:00:31] <sergi0>	 hi
[20:00:41] <Dreamy_Jazz>	 \o
[20:02:06] <cjming>	 hi i can deploy
[20:02:11] <Dreamy_Jazz>	 :D
[20:02:46] <wikibugs>	 (03PS3) 10Clare Ming: clienthints: Pin wgCheckUserDisplayClientHints to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz)
[20:03:27] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+2] Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm)
[20:03:51] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz)
[20:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: clienthints: Pin wgCheckUserDisplayClientHints to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz)
[20:05:05] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:958024|clienthints: Pin wgCheckUserDisplayClientHints to false (T337942)]]
[20:05:11] <stashbot>	 T337942: Display client hint data - https://phabricator.wikimedia.org/T337942
[20:05:18] <wikibugs>	 (03CR) 10Btullis: [C: 04-1] "Sorry I haven't had much of a chance to address this with you yet, but I don't think it's ready yet." [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[20:05:47] <wikibugs>	 (03PS2) 10Clare Ming: clienthints: Enable purging of data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz)
[20:06:07] <Dreamy_Jazz>	 Thanks. I won't be able to test this one as this config does not exist but will do once a patch that depends on this config change is merged.
[20:06:13] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage
[20:06:14] <Dreamy_Jazz>	 As such there isn't anything to test
[20:06:29] <cjming>	 Dreamy_Jazz: roger that - i'll go ahead and sync then
[20:06:32] <logmsgbot>	 !log cjming@deploy1002 cjming and dreamyjazz: Backport for [[gerrit:958024|clienthints: Pin wgCheckUserDisplayClientHints to false (T337942)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:06:38] <logmsgbot>	 !log cjming@deploy1002 cjming and dreamyjazz: Continuing with sync
[20:09:19] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage
[20:12:56] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage
[20:13:23] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:958024|clienthints: Pin wgCheckUserDisplayClientHints to false (T337942)]] (duration: 08m 18s)
[20:13:27] <stashbot>	 T337942: Display client hint data - https://phabricator.wikimedia.org/T337942
[20:13:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz)
[20:14:45] <wikibugs>	 (03Merged) 10jenkins-bot: clienthints: Enable purging of data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz)
[20:15:00] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:958025|clienthints: Enable purging of data on all wikis (T257893)]]
[20:15:07] <stashbot>	 T257893: [EPIC] Support User-Agent Client Hints header in CheckUser - https://phabricator.wikimedia.org/T257893
[20:15:13] <Dreamy_Jazz>	 Thanks. I'll not really be able to test this one either as it relies on jobs that are queued up and I'm not sure that those jobs are sent to mwdebug servers?
[20:15:28] <Dreamy_Jazz>	 Plus it's a random chance as to whether the job is queued.
[20:15:59] <cjming>	 Dreamy_Jazz: sounds good -- your 1st patch should be live and i'll sync your 2nd patch shortly
[20:16:02] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage
[20:16:05] <Dreamy_Jazz>	 Great.
[20:16:25] <cjming>	 Kizule: are you here for your patch?
[20:16:27] <logmsgbot>	 !log cjming@deploy1002 cjming and dreamyjazz: Backport for [[gerrit:958025|clienthints: Enable purging of data on all wikis (T257893)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:16:38] <logmsgbot>	 !log cjming@deploy1002 cjming and dreamyjazz: Continuing with sync
[20:17:34] <cjming>	 Sergi0: i'll proceed with yours next
[20:17:43] <sergi0>	 great
[20:23:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:23:44] <wikibugs>	 (03Merged) 10jenkins-bot: Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm)
[20:24:10] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:24:25] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:958025|clienthints: Enable purging of data on all wikis (T257893)]] (duration: 09m 24s)
[20:24:28] <stashbot>	 T257893: [EPIC] Support User-Agent Client Hints header in CheckUser - https://phabricator.wikimedia.org/T257893
[20:24:56] <logmsgbot>	 !log cjming@deploy1002 Started scap: Backport for [[gerrit:957871|Link recommendations: prevent too large offsets in cirrus queries (T345713)]]
[20:24:59] <stashbot>	 T345713: fixLinkRecommendationData script yields cirrussearch-offset-too-large - https://phabricator.wikimedia.org/T345713
[20:25:05] <cjming>	 Dreamy_Jazz: 2nd patch is live
[20:25:21] <Dreamy_Jazz>	 Thanks!
[20:25:29] <cjming>	 yw!
[20:25:48] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt)
[20:26:26] <logmsgbot>	 !log cjming@deploy1002 urbanecm and cjming: Backport for [[gerrit:957871|Link recommendations: prevent too large offsets in cirrus queries (T345713)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:26:31] <cjming>	 sergi0: is your patch testable?
[20:26:59] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[20:27:00] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:27:07] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1009.eqiad.wmnet with OS bullseye
[20:27:12] <sergi0>	 yes, I'll try to run the script once in the debug server, we don't need to wait for it to finish
[20:27:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye completed: -...
[20:28:08] <sergi0>	 (in dry mode)
[20:28:23] <cjming>	 sergi0: i'll wait for your greenlight to sync
[20:28:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:28:53] <wikibugs>	 (03PS1) 10Dr0ptp4kt: dr0ptp4kt WDQS, Search, Analytics access [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694)
[20:29:26] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:29:47] <sergi0>	 seems ok on my end
[20:29:52] <cjming>	 great - syncing
[20:29:57] <logmsgbot>	 !log cjming@deploy1002 urbanecm and cjming: Continuing with sync
[20:30:26] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002"
[20:30:32] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1008.eqiad.wmnet with OS bullseye
[20:30:38] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye completed: -...
[20:31:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm)
[20:32:34] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) 05Open→03Resolved @Btullis completed
[20:36:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:36:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:36:36] <logmsgbot>	 !log cjming@deploy1002 Finished scap: Backport for [[gerrit:957871|Link recommendations: prevent too large offsets in cirrus queries (T345713)]] (duration: 11m 40s)
[20:36:40] <stashbot>	 T345713: fixLinkRecommendationData script yields cirrussearch-offset-too-large - https://phabricator.wikimedia.org/T345713
[20:36:47] <cjming>	 sergi0: should be live!
[20:37:01] <sergi0>	 cool. Thank you!
[20:37:12] <cjming>	 np!
[20:37:36] <cjming>	 i'll keep the window open for a few more minutes in case Kizule shows up
[20:38:39] <wikibugs>	 (03PS3) 10Clare Ming: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) (owner: 10Zoranzoki21)
[20:41:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:41:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[20:42:18] <wikibugs>	 (03PS4) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey)
[20:44:07] <wikibugs>	 (03CR) 10Dr0ptp4kt: [C: 03+1] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric)
[20:49:18] <cjming>	 !log end of UTC late backport window
[20:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:51:45] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:52:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) Regarding public keys: Both are now published on the office wiki: https://office.wikimedia.org/wiki/User:AHoelzl-WMF  I don't seem to have an AHoelzl-WMF account:...
[20:53:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:55:16] <wikibugs>	 (03PS5) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[20:56:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[20:59:15] <wikibugs>	 (03PS6) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T2100). Please do the needful.
[21:00:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[21:13:53] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1003.eqiad.wmnet
[21:15:23] <wikibugs>	 (03PS1) 10Ryan Kemper: wdqs: decom old canary wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198)
[21:15:35] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:15:58] <wikibugs>	 (03PS2) 10Ryan Kemper: wdqs: decom old canary wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198)
[21:16:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:17:35] <wikibugs>	 (03PS3) 10Ryan Kemper: wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198)
[21:17:49] <wikibugs>	 (03PS4) 10Ryan Kemper: wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198)
[21:18:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:19:51] <maryum>	 !log Deployed patch for T344359
[21:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:20:48] <wikibugs>	 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T346699 (10RKemper)
[21:22:23] <wikibugs>	 (03CR) 10Bking: [C: 03+1] wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper)
[21:22:29] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper)
[21:23:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:29:09] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:30:51] <wikibugs>	 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba)
[21:36:55] <wikibugs>	 (03CR) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[21:40:10] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox
[21:45:56] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[21:47:21] <wikibugs>	 (03PS7) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[21:48:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[21:48:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:49:03] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:49:17] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[21:49:17] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:49:18] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1003.eqiad.wmnet
[21:50:13] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:50:23] <wikibugs>	 (03PS8) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[21:50:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:51:19] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1004.eqiad.wmnet
[21:51:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[21:54:47] <wikibugs>	 (03PS9) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[21:56:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[21:58:53] <wikibugs>	 (03PS10) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[21:59:15] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox
[22:00:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[22:01:13] <wikibugs>	 (03PS11) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[22:02:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking)
[22:06:41] <wikibugs>	 (03PS12) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792)
[22:07:54] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[22:08:58] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001"
[22:08:58] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:08:59] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1004.eqiad.wmnet
[22:31:31] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:46:31] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) 05Open→03Resolved
[22:48:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:50:13] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:52:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:54:57] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:55:17] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[22:55:41] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.329 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:28:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:30:23] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:31:19] <icinga-wm>	 PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:32:53] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:34:17] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:37:35] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:38:33] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:45:55] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:46:43] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:52:55] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:54:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:59:07] <icinga-wm>	 RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state