[00:00:24] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:00:33] (DatasourceError) firing: Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:01:50] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:05:48] (DatasourceError) firing: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:10:48] (DatasourceError) resolved: (2) Nonwrite HTTP requests with primary DB writes alert - https://grafana.wikimedia.org/alerting/grafana/4p0FIj1Vkz/view - https://wikitech.wikimedia.org/wiki/Monitoring/DatasourceError - https://alerts.wikimedia.org/?q=alertname%3DDatasourceError [00:16:08] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:16:12] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:38:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957813 [00:38:30] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957813 (owner: 10TrainBranchBot) [00:42:18] RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:42:50] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:53:27] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957813 (owner: 10TrainBranchBot) [01:15:02] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:42] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:49:18] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:54:18] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (DELETE certificaterequests) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [01:55:34] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:58:24] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:01:20] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:02:46] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:06:31] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:19:37] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:28:44] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:30:08] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:34:37] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:26] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:38:52] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:57:12] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:58:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:58:38] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:03:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [03:07:14] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:08:38] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:25:23] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10Andrew) Background: Each host has three dns servers, mdns (which is managed by openstack designate) pdns (auth for the outside world) and pdns-recursor (for VM reques... [03:54:04] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [03:56:54] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:46:34] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:49:24] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1 [04:56:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:04:44] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:06:10] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:15:56] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:17:20] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:26:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:48:50] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:50:16] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:33] (03PS1) 10Ilias Sarantopoulos: alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) [05:59:26] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:00:50] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:15:01] (03PS3) 10Ilias Sarantopoulos: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) [06:24:19] (03PS4) 10Elukey: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [06:27:07] (03PS5) 10Elukey: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [06:36:31] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:43:01] 10SRE, 10Infrastructure-Foundations: Identity Management System for Wikimedia developer accounts - https://phabricator.wikimedia.org/T315867 (10SLyngshede-WMF) 05Open→03Invalid Project is in development, see: https://phabricator.wikimedia.org/T189531 [06:53:41] (03PS1) 10Muehlenhoff: Switch sretest1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958389 [06:57:47] (03PS1) 10Ayounsi: dns: remove mentions of knams [dns] - 10https://gerrit.wikimedia.org/r/958390 (https://phabricator.wikimedia.org/T344579) [06:59:20] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [07:00:05] Amir1, Urbanecm, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T0700). [07:00:05] Aca: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:10] * Aca waves [07:00:51] good morning [07:01:37] Good morning, taavi! Are you deploying today? [07:01:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957873 (https://phabricator.wikimedia.org/T346589) (owner: 10Acamicamacaraca) [07:01:43] yes [07:02:02] okie, nice, I'm around [07:02:30] (03Merged) 10jenkins-bot: robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957873 (https://phabricator.wikimedia.org/T346589) (owner: 10Acamicamacaraca) [07:03:56] !log taavi@deploy1002 Started scap: Backport for [[gerrit:957873|robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki (T346589)]] [07:04:03] T346589: Disable indexing user (talk) pages and draft (talk) pages on shwiki - https://phabricator.wikimedia.org/T346589 [07:04:54] (03CR) 10Majavah: "Please remember that all patches merged to this repository must be pulled down to deploy1002 or the next deployer will get confused due to" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [07:09:39] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957940 (owner: 10Slyngshede) [07:13:20] !log taavi@deploy1002 aleksandar and taavi: Backport for [[gerrit:957873|robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki (T346589)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:13:23] T346589: Disable indexing user (talk) pages and draft (talk) pages on shwiki - https://phabricator.wikimedia.org/T346589 [07:13:29] checking it now via Debug [07:15:56] !log installing clamav security updates [07:15:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:16:26] Seems fine. According to page data, drafts and userpages now have indexing disallowed, which is expected. [07:17:55] thanks. syncing [07:17:57] !log taavi@deploy1002 aleksandar and taavi: Continuing with sync [07:18:21] (03PS1) 10Majavah: typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 [07:21:42] (03CR) 10Ayounsi: [C: 03+1] typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 (owner: 10Majavah) [07:26:21] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:957873|robots.txt: Disable indexing user (talk) pages and draft (talk) pages on shwiki (T346589)]] (duration: 22m 24s) [07:26:24] T346589: Disable indexing user (talk) pages and draft (talk) pages on shwiki - https://phabricator.wikimedia.org/T346589 [07:26:27] Aca: your patch is live. it might take a while for search engines to notice and remove any current pages from their indexes, but there's unfortunately not much we can do about that [07:26:47] Understandable! Thank you! [07:26:49] (03CR) 10Majavah: [C: 03+2] typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 (owner: 10Majavah) [07:27:29] (03Merged) 10jenkins-bot: typos: remove knams/pmtpa references [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958392 (owner: 10Majavah) [07:30:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:30:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:32:26] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:33:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:34:30] (03CR) 10Jelto: [C: 03+2] miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [07:35:31] (03Merged) 10jenkins-bot: miscweb: add static-codereview to wikikube [deployment-charts] - 10https://gerrit.wikimedia.org/r/957681 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [07:37:59] 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops: Netbox Juniper report - https://phabricator.wikimedia.org/T306238 (10SLyngshede-WMF) Rereading the answer for Juniper: > For OIDC we’ll need your IDToken which would look like below or the IDP Issuer URL (This URL must be publicly accessible). > S... [07:38:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:44:03] taavi: I want to rebase a change in deploy1002? is that fine [07:44:21] Amir1: yes, I'm done deploying [07:44:37] (03CR) 10Ladsgroup: [C: 03+2] Enable native MathML on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958054 (https://phabricator.wikimedia.org/T346584) (owner: 10Physikerwelt) [07:44:41] awesome [07:46:39] (03Merged) 10jenkins-bot: Enable native MathML on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958054 (https://phabricator.wikimedia.org/T346584) (owner: 10Physikerwelt) [07:51:26] (03CR) 10Klausman: alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [07:51:53] (03CR) 10Klausman: [C: 03+1] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [07:52:24] (03CR) 10Klausman: [C: 03+1] services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [07:57:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:57:59] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) Yes and no. The wmflib code could be improved to distinguish between a permission error and any other error and raise two differen... [07:58:06] !log running db checksum run in s3 eqiad replicas (T207253) [07:58:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:58:12] T207253: Automatically compare a few tables per section between hosts and DC - https://phabricator.wikimedia.org/T207253 [08:01:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.281 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:00] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.077 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:14:05] (03CR) 10Abijeet Patro: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [08:14:44] (03CR) 10Kosta Harlan: [C: 03+1] clienthints: Enable purging of data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz) [08:15:09] (03CR) 10Kosta Harlan: [C: 03+1] clienthints: Pin wgCheckUserDisplayClientHints to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz) [08:18:52] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:19:03] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: refactor ensure_packages [puppet] - 10https://gerrit.wikimedia.org/r/957848 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [08:19:11] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: use timer name in journal [puppet] - 10https://gerrit.wikimedia.org/r/957847 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [08:19:14] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: fix permissions on logs and 'lnms' [puppet] - 10https://gerrit.wikimedia.org/r/957846 (https://phabricator.wikimedia.org/T344136) (owner: 10Filippo Giunchedi) [08:20:16] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:24:20] PROBLEM - Check systemd state on stat1005 is CRITICAL: CRITICAL - degraded: The following units failed: session-c1679.scope,user@113.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [08:25:17] (03CR) 10Muehlenhoff: [C: 03+2] Switch sretest1002 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958389 (owner: 10Muehlenhoff) [08:26:27] (03PS1) 10Ilias Sarantopoulos: ml-services: update revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958398 (https://phabricator.wikimedia.org/T346445) [08:28:24] (03CR) 10Vgutierrez: varnishkafka: logrotate should use systemctl to reload rsyslog (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/957995 (owner: 10Fabfur) [08:28:45] (03CR) 10Filippo Giunchedi: [C: 03+1] "Fair enough re: manual removal, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [08:29:57] (03PS1) 10Ilias Sarantopoulos: ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 [08:31:14] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: support setting owner/group in assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957850 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi) [08:31:17] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: snmp-exporter support for assemble-config [puppet] - 10https://gerrit.wikimedia.org/r/957851 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi) [08:31:22] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: use assemble-config for snmp-exporter [puppet] - 10https://gerrit.wikimedia.org/r/957852 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi) [08:32:46] (03PS1) 10Muehlenhoff: Switch ganeti-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958400 [08:33:52] (03CR) 10Fabfur: [V: 03+1 C: 04-1] varnishkafka: logrotate should use systemctl to reload rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957995 (owner: 10Fabfur) [08:35:02] (03PS1) 10Filippo Giunchedi: prometheus: fix old reference to prometheus-snmp-exporter-config [puppet] - 10https://gerrit.wikimedia.org/r/958401 (https://phabricator.wikimedia.org/T346335) [08:35:57] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: fix old reference to prometheus-snmp-exporter-config [puppet] - 10https://gerrit.wikimedia.org/r/958401 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi) [08:36:47] (03CR) 10Sergio Gimeno: [C: 03+1] "lgtm. Curiosity, the access token for beta is set in private/PrivateSettings.php, how is that file handled? Are there any docs around?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm) [08:40:39] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958398 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [08:41:17] (03PS2) 10Slyngshede: gerrit: Link account creation to IDM. [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) [08:41:25] (03CR) 10Slyngshede: gerrit: Link account creation to IDM. (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede) [08:41:58] (03Merged) 10jenkins-bot: ml-services: update revscoring isvcs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958398 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [08:42:16] (03CR) 10Hashar: [C: 03+1] "Excellent :)" [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede) [08:43:13] (03CR) 10Slyngshede: [C: 03+2] gerrit: Link account creation to IDM. [puppet] - 10https://gerrit.wikimedia.org/r/953967 (https://phabricator.wikimedia.org/T345226) (owner: 10Slyngshede) [08:43:32] (03PS2) 10Fabfur: varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) [08:43:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958400 (owner: 10Muehlenhoff) [08:43:58] (03CR) 10CI reject: [V: 04-1] varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur) [08:44:01] (03PS1) 10Filippo Giunchedi: prometheus: let assemble-config write snmp.yml [puppet] - 10https://gerrit.wikimedia.org/r/958402 (https://phabricator.wikimedia.org/T346335) [08:45:09] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: let assemble-config write snmp.yml [puppet] - 10https://gerrit.wikimedia.org/r/958402 (https://phabricator.wikimedia.org/T346335) (owner: 10Filippo Giunchedi) [08:45:51] (03CR) 10Fabfur: [C: 04-1] varnishkafka: logrotate should use systemctl to reload rsyslog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur) [08:46:35] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [08:46:56] (03PS3) 10Fabfur: varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) [08:47:50] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [08:48:00] 10SRE, 10Traffic, 10Patch-For-Review: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 (10Vgutierrez) [08:48:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur) [08:48:08] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Vgutierrez) [08:50:27] (03PS1) 10Filippo Giunchedi: benthos: more informative processor labels for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/958403 (https://phabricator.wikimedia.org/T346140) [08:50:29] (03PS1) 10JMeybohm: kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) [08:51:07] 10SRE, 10GrowthExperiments-Homepage, 10GrowthExperiments-ImpactModule, 10Growth-Team (Current Sprint), 10Performance Issue: RefreshUserImpactJob consumes too many file descriptors - https://phabricator.wikimedia.org/T344428 (10Urbanecm_WMF) Tagging #sre for assistance with this issue, as it is definitely... [08:53:08] (03CR) 10Vgutierrez: [C: 04-1] aptrepo: Add Bookworm HAProxy third party repos (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957766 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [08:55:41] (03PS2) 10Ilias Sarantopoulos: ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 [08:56:29] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [08:56:54] (03CR) 10Ayounsi: [C: 03+2] dns: remove mentions of knams [dns] - 10https://gerrit.wikimedia.org/r/958390 (https://phabricator.wikimedia.org/T344579) (owner: 10Ayounsi) [08:57:51] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:58:59] (03CR) 10Elukey: [C: 03+1] ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 (owner: 10Ilias Sarantopoulos) [08:59:17] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:02:32] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:02:54] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:03:05] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-articletopic' for release 'main' . [09:03:15] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:03:22] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-drafttopic' for release 'main' . [09:03:31] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:03:40] jouncebot: now [09:03:40] No deployments scheduled for the next 0 hour(s) and 56 minute(s) [09:03:42] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:03:47] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:03:51] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:03:56] I am going to merge a change for Flow which only affects tests ( https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Flow/+/957872 ) [09:04:44] (03PS1) 10JMeybohm: kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) [09:04:46] (03PS1) 10JMeybohm: kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) [09:05:02] (03CR) 10Hashar: [C: 03+2] tests: Do not assume UTSysop exists [extensions/Flow] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957872 (https://phabricator.wikimedia.org/T346253) (owner: 10Urbanecm) [09:05:18] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:06:09] (03CR) 10Elukey: [C: 03+2] profile::thanos: add increase-based rec rules for Istio [puppet] - 10https://gerrit.wikimedia.org/r/956841 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [09:06:59] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:07:10] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43330/console" [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:07:29] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43331/console" [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:07:49] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43332/console" [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:09:52] (03Merged) 10jenkins-bot: tests: Do not assume UTSysop exists [extensions/Flow] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957872 (https://phabricator.wikimedia.org/T346253) (owner: 10Urbanecm) [09:12:39] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 (owner: 10Ilias Sarantopoulos) [09:13:26] (03Merged) 10jenkins-bot: ml-services: fix reverted values [deployment-charts] - 10https://gerrit.wikimedia.org/r/958399 (owner: 10Ilias Sarantopoulos) [09:14:57] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:15:05] (03PS1) 10Ilias Sarantopoulos: httpbb: fix ml-staging eswikiquote [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) [09:15:41] (03CR) 10Stevemunene: [C: 03+2] datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [09:16:23] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:16:28] (03CR) 10Giuseppe Lavagetto: [C: 03+1] kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:16:42] (03Merged) 10jenkins-bot: datahub: add oidc production settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/954132 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [09:18:32] (03CR) 10KartikMistry: [C: 03+1] Enable MinT translation service on Meta-Wiki - rollout #5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [09:21:18] (03CR) 10Elukey: [C: 03+1] kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:23:57] (03CR) 10Elukey: [C: 03+1] kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:24:29] (03CR) 10Elukey: [C: 03+1] kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:24:57] (03CR) 10Ayounsi: [C: 03+1] Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [09:25:09] !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [09:25:10] !log hashar@deploy1002 Started scap: Backport for [[gerrit:957872|tests: Do not assume UTSysop exists (T346253)]] [09:25:14] T346253: CannotCreateActorException: Cannot create an actor for a usable name that is not an existing user: user_name="U" - https://phabricator.wikimedia.org/T346253 [09:25:37] (03CR) 10Elukey: httpbb: fix ml-staging eswikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [09:25:40] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:26:36] (03PS2) 10Ilias Sarantopoulos: httpbb: fix ml-staging eswikiquote [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) [09:26:43] !log hashar@deploy1002 hashar and urbanecm: Backport for [[gerrit:957872|tests: Do not assume UTSysop exists (T346253)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [09:27:20] (03CR) 10Ilias Sarantopoulos: httpbb: fix ml-staging eswikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [09:27:22] !log hashar@deploy1002 hashar and urbanecm: Continuing with sync [09:28:36] !log set max-repeaters for cr3-eqsin in librenms - T346606 [09:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:40] T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 [09:28:43] !log set max-repeaters to 20 for cr3-eqsin in librenms - T346606 [09:28:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:42] (03CR) 10Cathal Mooney: [C: 03+2] Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [09:30:17] (03Merged) 10jenkins-bot: Explicitly set hash mode type on QFX5100 devices for ECMP [homer/public] - 10https://gerrit.wikimedia.org/r/957925 (https://phabricator.wikimedia.org/T339852) (owner: 10Cathal Mooney) [09:30:17] PROBLEM - Wikitech-static main page has content on wikitech-static.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [09:30:52] (03CR) 10Alexandros Kosiaris: [C: 03+1] "\ο/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [09:31:59] !log disabled puppet on cp4052 for T346602 [09:32:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:32:12] T346602: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 [09:34:17] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:957872|tests: Do not assume UTSysop exists (T346253)]] (duration: 09m 06s) [09:34:17] RECOVERY - Wikitech-static main page has content on wikitech-static.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 27458 bytes in 0.235 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [09:34:21] T346253: CannotCreateActorException: Cannot create an actor for a usable name that is not an existing user: user_name="U" - https://phabricator.wikimedia.org/T346253 [09:38:53] !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [09:39:56] !log enabled puppet on cp4052 for T346602 [09:39:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:59] T346602: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 [09:40:03] !log disabled puppet on cp4050 for T346602 [09:40:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:17] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:42:17] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:42:54] (03CR) 10Ilias Sarantopoulos: [C: 03+1] ":D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [09:43:10] (03CR) 10Elukey: [C: 03+2] services: disable Changeprop's ORES Cache stream [deployment-charts] - 10https://gerrit.wikimedia.org/r/957687 (https://phabricator.wikimedia.org/T342116) (owner: 10Elukey) [09:43:26] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [09:43:27] jouncebot: next [09:43:27] In 0 hour(s) and 16 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1000) [09:43:45] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:43:45] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:44:15] !log enabled puppet on cp4050 for T346602 [09:44:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:09] (03PS1) 10Arnaudb: icinga: add my arnaudb [puppet] - 10https://gerrit.wikimedia.org/r/957815 [09:45:59] 10SRE, 10Traffic, 10Patch-For-Review: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 (10Fabfur) Cannot test the actual change with PCC but tested on two different hosts (cp4050 && cp4052) to check behavior. The new logrotate configuration actually seems to rotate correc... [09:46:05] (03PS2) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 [09:46:37] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:46:44] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [09:46:57] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [09:47:31] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:48:32] 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10jcrespo) [09:48:42] (03CR) 10Fabfur: "This has been tested on 2 different hosts in production (cp4050 and cp4052) and the behavior is the expected one." [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur) [09:48:57] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:49:01] (03PS3) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) [09:49:33] (03CR) 10Fabfur: [C: 03+2] varnishkafka: logrotate should use systemctl to reload rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/957995 (https://phabricator.wikimedia.org/T346602) (owner: 10Fabfur) [09:49:37] 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10jcrespo) a:03ABran-WMF [09:49:42] (03CR) 10CI reject: [V: 04-1] icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [09:49:42] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [09:49:46] 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10jcrespo) p:05Triage→03High [09:49:48] (03CR) 10Btullis: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:49:54] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [09:49:57] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly [09:50:09] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [09:50:27] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) [09:50:48] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [09:50:48] !log kamila@cumin1001 START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite [09:50:50] !log kamila@cumin1001 END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) [09:51:48] kamila_: o/ are you testing? Wondering if it is ok for me to continue deploying changeprop or not [09:52:15] elukey: ah, sorry, I'm done now [09:52:47] (did you also want to make use of the space between deployment windows? :D) [09:53:13] :D [09:53:17] okok proceeding [09:53:57] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:53:59] PROBLEM - Wikitech and wt-static content in sync on wikitech-static.wikimedia.org is CRITICAL: wikitech-static CRIT - wikitech and wikitech-static out of sync (207534s 200000s) https://wikitech.wikimedia.org/wiki/Wikitech-static [09:54:39] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: sync [09:54:50] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [09:55:01] (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1010.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957918 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [09:55:23] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:56:02] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [09:56:58] (03CR) 10Vgutierrez: [C: 04-1] "ticket-test.wm.o isn't a valid SAN on the backend TLS certificate" [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [09:58:31] (03PS1) 10Arturo Borrero Gonzalez: dbutils: introduce statement define [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603) [09:59:04] (03PS2) 10Arturo Borrero Gonzalez: dbutils: introduce statement define [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603) [09:59:31] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603) (owner: 10Arturo Borrero Gonzalez) [09:59:33] (03CR) 10Btullis: [C: 03+1] "Adding jbond and Muehlenhoff for visibility." [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [09:59:53] !log remove ores-cache stream from changeprop (side effects - higher ORES client latencies, no mediawiki.revision-score event stream published) - https://phabricator.wikimedia.org/T342116 [09:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1000) [10:03:06] (03CR) 10Elukey: [C: 03+2] httpbb: fix ml-staging eswikiquote (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958429 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [10:03:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] dbutils: introduce statement define [puppet] - 10https://gerrit.wikimedia.org/r/958432 (https://phabricator.wikimedia.org/T346603) (owner: 10Arturo Borrero Gonzalez) [10:05:58] (03PS1) 10Jelto: trafficserver: switch static-codereview.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) [10:06:38] (03CR) 10Elukey: [C: 03+1] benthos: more informative processor labels for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/958403 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [10:07:24] (03PS3) 10Kamila Součková: wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) [10:07:43] (03CR) 10Filippo Giunchedi: [C: 03+2] benthos: more informative processor labels for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/958403 (https://phabricator.wikimedia.org/T346140) (owner: 10Filippo Giunchedi) [10:10:34] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:15:01] (03CR) 10Vgutierrez: [C: 04-1] "please check the TLS material used in the backend side:" [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [10:15:17] (03PS2) 10Ilias Sarantopoulos: Lower ores.wikimedia.org's TTL to 5M [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey) [10:16:26] (03PS6) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) [10:16:28] (03PS6) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [10:16:30] (03PS6) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [10:16:32] (03PS6) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [10:16:34] (03PS6) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [10:16:36] (03PS1) 10Brouberol: Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) [10:17:52] (03CR) 10Vgutierrez: [C: 03+1] Lower ores.wikimedia.org's TTL to 5M (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/957689 (owner: 10Elukey) [10:19:52] (03CR) 10Brouberol: "I attempted to fix the node regular expression by getting rid of the `[01-10]` range that didn't seem to work, and rebased all other chang" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [10:19:58] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [10:20:11] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:20:45] (03CR) 10Stevemunene: [C: 03+2] idp: add datahub as oidc service [puppet] - 10https://gerrit.wikimedia.org/r/954009 (https://phabricator.wikimedia.org/T305874) (owner: 10Stevemunene) [10:28:02] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on P{maps200[5,6].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad) [10:28:45] (03PS2) 10Brouberol: Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) [10:28:47] (03PS7) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) [10:28:49] (03PS7) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [10:28:51] (03PS7) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [10:28:53] (03PS7) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [10:28:55] (03PS7) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [10:29:07] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:21] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [10:29:56] (03CR) 10CI reject: [V: 04-1] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [10:33:02] (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge" [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [10:33:17] (03CR) 10Kamila Součková: "adding -2 for now to avoid accidental merge" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [10:33:53] (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge (for real this time :D)" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [10:33:58] !log set max-repeaters to 20 for cr3-eqsin using "force save" - T346606 [10:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:02] T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 [10:34:04] (03CR) 10Btullis: [C: 03+1] Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [10:36:10] (03CR) 10Jelto: "looping in jayme for some more ingress certificate insights." [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [10:36:31] (JobUnavailable) firing: (4) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:36:38] (03CR) 10Filippo Giunchedi: "CI isn't happy about the commit message, config change LGTM though" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [10:40:29] !log stevemunene@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [10:41:12] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:41:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:20] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:42:25] 10SRE, 10Cloud-VPS, 10Toolforge: Some of my tools (eg wikidata-todo) just start throwing 504 errors - https://phabricator.wikimedia.org/T346126 (10fnegri) @M2k_dewiki the Kubernetes pod was stuck, I restarted it manually with `webservice stop` followed by `webservice start`, and https://templatetransclusionc... [10:42:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on P{maps200[5,6].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad) [10:42:28] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:42:55] (03CR) 10Vgutierrez: [C: 03+1] trafficserver: switch static-codereview.wikimedia.org to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [10:44:33] !log stevemunene@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [10:44:54] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:44:54] !log stevemunene@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [10:46:37] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on P{maps200[7,8].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad) [10:46:48] (03PS3) 10Brouberol: Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) [10:46:50] (03PS8) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) [10:46:52] (03PS8) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [10:46:54] (03PS8) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [10:46:56] (03PS8) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [10:46:58] (03PS8) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [10:47:06] 10SRE, 10Traffic: VarnishKafka logrotate fails on bookworm - https://phabricator.wikimedia.org/T346602 (10Vgutierrez) 05Open→03Resolved a:03Fabfur [10:47:14] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10Vgutierrez) [10:47:48] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Aklapper) >>! In T335879#9173531, @Volans wrote: > The wmflib code could be improved to distinguish between a permission error and any oth... [10:48:23] !log stevemunene@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [10:49:18] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [10:59:12] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) @Aklapper What I meant is that there is no way to distinguish between the "no access" error and any other error that could be a mi... [11:01:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on P{maps200[7,8].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad) [11:05:45] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on P{maps201[0].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad) [11:13:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on P{maps201[0].codfw.wmnet} and (A:maps-replica or A:maps-replica-codfw or A:maps-replica-eqiad) [11:14:48] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling reboot on A:maps-replica-eqiad [11:15:46] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Aklapper) >>! In T335879#9174123, @Volans wrote: > It's just the message that differ, that is something wmflib should not rely on because... [11:16:49] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [11:23:21] (03CR) 10Muehlenhoff: [C: 03+2] Move Ganeti to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/957865 (owner: 10Muehlenhoff) [11:30:50] (03PS2) 10Muehlenhoff: Switch ganeti-test to nftables [puppet] - 10https://gerrit.wikimedia.org/r/958400 [11:32:48] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:34:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958400 (owner: 10Muehlenhoff) [11:34:14] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:37:08] !log aborrero@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1005.wikimedia.org [11:42:18] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:43:00] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] kubernetes::master: Switch to PKI for SA signing [puppet] - 10https://gerrit.wikimedia.org/r/958404 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:43:08] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:44:32] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:44:41] !log removed cergen certs from the list of trusted service account token signers on all kubernetes clusters - T329826 [11:44:42] !log aborrero@cumin1001 START - Cookbook sre.dns.netbox [11:44:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:44] T329826: Kubernetes v1.23 use PKI for service-account signing (instead of cergen) - https://phabricator.wikimedia.org/T329826 [11:45:09] !log aborrero@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1005 - aborrero@cumin1001" [11:45:10] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:46:00] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:46:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudservices1005 - aborrero@cumin1001" [11:46:02] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:11] !log aborrero@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:46:12] !log aborrero@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudservices1005.wikimedia.org [11:47:05] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by aborrero@cumin1001 for hosts: `cloudservices1005.wikimedia.org` - cloudservices10... [11:50:35] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) [11:53:30] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1140.eqiad.wmnet with OS bullseye [11:54:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling reboot on A:maps-replica-eqiad [11:54:38] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST revisions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [11:56:22] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43333/console" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [11:56:33] (03PS7) 10Stevemunene: admin: Create analytics-wmde system user and airflow admin group [puppet] - 10https://gerrit.wikimedia.org/r/949001 (https://phabricator.wikimedia.org/T340648) [11:58:03] (03PS1) 10Kamila Součková: db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) [11:58:19] (03CR) 10Jelto: [C: 03+2] trafficserver: switch static-codereview.wikimedia.org to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958433 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [11:58:31] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [11:59:38] (KubernetesAPILatency) resolved: (8) High Kubernetes API latency (LIST clusterissuers) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:00:08] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10Vgutierrez) Thanks!, still blocked on @thcipriani for deployment group membership [12:00:58] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43334/console" [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [12:02:08] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Ensure that logs are created with correct permissions. [puppet] - 10https://gerrit.wikimedia.org/r/957940 (owner: 10Slyngshede) [12:02:38] (03PS1) 10Giuseppe Lavagetto: kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463 [12:02:40] (03CR) 10Marostegui: [C: 03+1] db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:03:12] (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge" [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:03:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [12:03:36] (03PS1) 10Marostegui: wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) [12:04:02] (03CR) 10Marostegui: [C: 04-2] "To be pushed after the switch" [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui) [12:05:53] (03CR) 10Marostegui: [C: 03+1] "Kamila the reason why es1, es2 and es3 aren't in Orchestrator is because they are standalone hosts and orchestrator doesn't support that (" [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:06:38] (03PS1) 10Arturo Borrero Gonzalez: openstack: drop references to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) [12:06:41] (03PS4) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) [12:07:09] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1140.eqiad.wmnet with reason: host reimage [12:07:38] (03CR) 10Clément Goubert: [C: 03+1] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [12:07:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop references to cloudcontrol1005 [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:08:03] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1141.eqiad.wmnet with OS bullseye [12:08:15] (03CR) 10Majavah: openstack: drop references to cloudcontrol1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:08:19] (03CR) 10Kamila Součková: [C: 04-2] db: Switch DNS master alias to codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:10:11] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1140.eqiad.wmnet with reason: host reimage [12:13:05] (03PS1) 10Arturo Borrero Gonzalez: openstack: remove overrides for designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/958467 (https://phabricator.wikimedia.org/T346042) [12:13:32] (03CR) 10Majavah: [C: 03+1] openstack: remove overrides for designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/958467 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:13:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: drop references to cloudcontrol1005 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958465 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:13:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: remove overrides for designate_hosts [puppet] - 10https://gerrit.wikimedia.org/r/958467 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [12:18:03] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:19:02] (03PS2) 10Clément Goubert: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [12:19:11] (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [12:20:47] (03CR) 10Marostegui: [C: 03+1] "Those es1-es3 aren't correct, but we don't really use them that much as "master" as they are all masters really." [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:21:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [12:21:30] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from a laptop - https://phabricator.wikimedia.org/T343336 (10fnegri) 05Open→03Resolved a:03fnegri This was fixed by @taavi in https://gerrit.wikimedia.org/r/c... [12:21:36] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1141.eqiad.wmnet with reason: host reimage [12:23:42] !log installing libwebp security updates on bullseye [12:23:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:23:54] (03CR) 10JMeybohm: [C: 04-1] profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890) (owner: 10Elukey) [12:24:00] !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1140.eqiad.wmnet with OS bullseye [12:24:34] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1141.eqiad.wmnet with reason: host reimage [12:26:41] 10SRE-tools, 10Cloud-VPS, 10Infrastructure-Foundations, 10Spicerack, 10cloud-services-team: spicerack: sal_logger does not work when running from CloudVPS instances - https://phabricator.wikimedia.org/T343335 (10fnegri) 05Open→03Resolved a:03fnegri Similarly to T343336, this was also fixed by @taav... [12:27:15] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: furud.codfw.wmnet [12:27:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: furud.codfw.wmnet [12:28:40] (03CR) 10Jcrespo: "Let me know what do you think for an amend 😊" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [12:29:31] (03PS2) 10Kamila Součková: db: Switch DNS master alias to codfw [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) [12:29:50] (03CR) 10Kamila Součková: db: Switch DNS master alias to codfw (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:30:39] (03CR) 10Clément Goubert: [C: 03+1] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [12:32:44] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wdqs-all [12:33:16] (03CR) 10Peter Fischer: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [12:34:12] (03CR) 10Peter Fischer: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [12:36:32] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:36:38] (03CR) 10Fabfur: [V: 03+2 C: 03+2] add support for unix sockets [software/purged] - 10https://gerrit.wikimedia.org/r/957362 (owner: 10Fabfur) [12:36:50] (03CR) 10Brouberol: "Seems like pcc can't run on the kafka-jumbo hosts given that the previous change request broke the node -> role assignment." [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [12:37:46] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [12:37:58] (03PS1) 10Kamila Součková: wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) [12:38:39] (03CR) 10Kamila Součková: [C: 04-2] "adding -2 for now to avoid accidental merge" [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:40:46] (03CR) 10Kamila Součková: [C: 04-2] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:41:03] (03CR) 10Marostegui: [C: 03+1] "\o/" [dns] - 10https://gerrit.wikimedia.org/r/958462 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [12:41:22] (03PS1) 10JMeybohm: chromium-render: Update to use certmanager [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) [12:42:10] (03PS2) 10JMeybohm: Update chromium-render to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958473 (https://phabricator.wikimedia.org/T300033) [12:43:07] (03PS1) 10Jelto: miscweb/microsites: move monitoring of static-codereview to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/958474 (https://phabricator.wikimedia.org/T346309) [12:43:09] (03PS1) 10Jelto: miscweb/microsites: remove static-codereview resources [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) [12:44:48] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) Thanks for the context @Andrew, I was thinking it was something like that thanks for filling in the gaps. I guess the big question I have is there any way to... [12:46:42] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) hey @Jclark-ctr (or @VRiley-WMF) this server should be ready to be re-racked into rack `D5`. [12:47:51] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1141.eqiad.wmnet with OS bullseye [12:48:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wdqs-all [12:48:17] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [12:48:33] 10SRE, 10Cloud-VPS: cloudservices1006 using 10. address to send DNS NOTIFYs to cloudservices1005 - https://phabricator.wikimedia.org/T346385 (10cmooney) >>! In T346385#9173361, @Andrew wrote: > That 10. address is in the current pool config. It's probably wrong, but also everything is changing constantly so I'... [12:52:42] (03CR) 10Gmodena: cirrus: add the mediawiki.cirrussearch.page_rerender stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [12:54:07] (03PS4) 10Anzx: add extranamespacenames for kannada-kn language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958050 (https://phabricator.wikimedia.org/T346583) [12:54:09] (03PS1) 10JMeybohm: Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) [12:54:31] (03PS2) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) [12:54:44] (03CR) 10CI reject: [V: 04-1] Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [12:55:56] (03PS2) 10JMeybohm: Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1300). [13:00:05] cormacparle and aanzx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:20] * cormacparle waves [13:01:51] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [13:01:55] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:02:17] !log set max-repeaters to 30 for cr3-eqsin in librenms - T346606 [13:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:02:29] T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 [13:02:50] hey. I can deploy [13:02:55] (03PS5) 10Majavah: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle) [13:02:55] o/ [13:03:06] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [13:03:11] (03PS1) 10Fabfur: add simple Makefile [software/purged] - 10https://gerrit.wikimedia.org/r/958477 [13:03:15] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover: list new primary DC servers first in debug.json - https://phabricator.wikimedia.org/T346472 (10kamila) [13:03:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle) [13:04:06] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:04:14] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [13:04:25] (03Merged) 10jenkins-bot: Disable UploadWizard CTA for MachineVision [mediawiki-config] - 10https://gerrit.wikimedia.org/r/955967 (https://phabricator.wikimedia.org/T345187) (owner: 10Cparle) [13:04:27] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [13:04:42] !log taavi@deploy1002 Started scap: Backport for [[gerrit:955967|Disable UploadWizard CTA for MachineVision (T345187)]] [13:04:53] T345187: [Spike] Figure out what's involved in turning MachineVision off - https://phabricator.wikimedia.org/T345187 [13:06:07] !log taavi@deploy1002 taavi and cparle: Backport for [[gerrit:955967|Disable UploadWizard CTA for MachineVision (T345187)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:06:16] cormacparle: please test your patch [13:06:23] (03CR) 10Clément Goubert: [C: 03+1] debug.json: List primary DC servers first [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957996 (https://phabricator.wikimedia.org/T346472) (owner: 10Kamila Součková) [13:06:24] 👍 [13:07:38] (03PS1) 10Muehlenhoff: sre.maps.reboot: Retire legacy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) [13:08:37] (03CR) 10Btullis: [C: 03+1] Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:09:06] taavi: seems fine, thank you [13:09:11] thanks, syncing [13:09:13] !log taavi@deploy1002 taavi and cparle: Continuing with sync [13:09:17] (03CR) 10Volans: [C: 03+1] "LGTM, kind reminder to follow after merging:" [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [13:10:06] (03CR) 10Brouberol: [C: 03+2] Fix kafka-jumbo node regular expression [puppet] - 10https://gerrit.wikimedia.org/r/958436 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:10:43] aanzx: hey. backports (changes to a mediawiki/* repository) generally should have a +2 before being scheduled for deployment (and preferrably go out via the train, unless they're particularly urgent). I'm not familiar with the languages system, but I added some people as reviewers [13:11:18] !log depool cp4052 for bookworm testing - T342154 [13:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:30] T342154: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 [13:12:28] aanzx: for the last patch, why is wmgUseWikidataPageBanner being changed? [13:12:35] taavi: ok then https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/958050 patch can also be scheduled for deployment later [13:15:59] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:955967|Disable UploadWizard CTA for MachineVision (T345187)]] (duration: 11m 16s) [13:16:02] T345187: [Spike] Figure out what's involved in turning MachineVision off - https://phabricator.wikimedia.org/T345187 [13:16:15] thanks taavi ! [13:16:53] taavi: for enabling minervasite notice wmgUseWikidataPageBanner is not required? [13:18:18] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:19:42] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:22:36] (03PS3) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) [13:22:42] aanzx: I don't know. from the comment it seems like WikidataPageBanner requires $wgMinervaEnableSiteNotice to be true, but I don't know about the opposite. it'd be a good idea to ask someone who knows [13:22:44] (03CR) 10CI reject: [V: 04-1] Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx) [13:24:17] taavi: ok i will schedule this patch also for later [13:24:24] thanks! [13:24:36] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx rolling restart_daemons on A:wcqs-public [13:25:28] (03CR) 10Jelto: [C: 03+1] "lgtm" [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [13:25:52] (03Abandoned) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx) [13:26:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx (exit_code=0) rolling restart_daemons on A:wcqs-public [13:29:43] (03PS1) 10JMeybohm: Update developer-portal to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958479 (https://phabricator.wikimedia.org/T300033) [13:29:45] (03CR) 10Elukey: alertmanager: create ml team alerts (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [13:29:59] (03CR) 10Nikerabbit: [C: 03+1] Enable MinT translation service on Meta-Wiki - rollout #5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958406 (https://phabricator.wikimedia.org/T341445) (owner: 10Abijeet Patro) [13:31:54] (03CR) 10Jelto: [C: 03+1] "looks good. I'm also not sure about the additional SANs. I'd also think the additional SANs are terminated by the ingress. But I can test " [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [13:37:51] Hi, is UTC afternoon backport window still in progress? [13:38:53] !log force-set max-repeaters to 20 for cr2-eqsin and cr3-eqsin - T346606 [13:38:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:57] (03CR) 10Herron: [V: 03+1] "are we still in a holding pattern on this one?" [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron) [13:38:57] T346606: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 [13:39:06] (03PS1) 10Muehlenhoff: Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) [13:39:30] (03CR) 10CI reject: [V: 04-1] Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:42:50] Nevermind, I'll add my patch for next one. [13:43:58] (03PS2) 10Muehlenhoff: Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) [13:44:21] (03CR) 10Filippo Giunchedi: [C: 03+1] "No we can go ahead" [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron) [13:45:54] (03PS2) 10Zoranzoki21: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) [13:46:14] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm [13:47:13] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2001.codfw.wmnet [13:48:11] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [13:48:38] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [13:51:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2001.codfw.wmnet [13:56:55] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [13:57:30] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr) @aborrero. I have moved server physically and in netbox. i did not delete any interfaces out of netbox new Cableid. 20220117 port 4... [13:57:36] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2002.codfw.wmnet [14:00:22] 10SRE, 10Infrastructure-Foundations, 10netops: scrape ripe atlas data for a few anchors at other large networks - https://phabricator.wikimedia.org/T252890 (10CDanis) 05Open→03Declined >>! In T252890#9165519, @ayounsi wrote: > @CDanis Is that still needed now that we have NEL? It would be interesting t... [14:00:29] (03CR) 10BBlack: [C: 03+2] wikireplicas: restore pybal monitoring [puppet] - 10https://gerrit.wikimedia.org/r/924508 (https://phabricator.wikimedia.org/T337446) (owner: 10BBlack) [14:00:38] (03PS1) 10JMeybohm: Copy mesh.certificate_1.0.0 to 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/958482 (https://phabricator.wikimedia.org/T300033) [14:00:40] (03PS1) 10JMeybohm: mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) [14:00:42] (03PS1) 10JMeybohm: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) [14:01:31] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2002.codfw.wmnet [14:01:37] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudbackup1001-dev.eqiad.wmnet with reason: host reimage [14:01:47] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-etcd2003.codfw.wmnet [14:02:52] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T346450 (10Jhancock.wm) a:03Jhancock.wm [14:04:17] !log lvs1020, lvs1018: restarting pybal to re-enable healthchecks for wikireplicas ( T337446 -> https://gerrit.wikimedia.org/r/924508 ) [14:04:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:28] T337446: Rebuild sanitarium hosts - https://phabricator.wikimedia.org/T337446 [14:05:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-etcd2003.codfw.wmnet [14:06:05] 10ops-eqiad: PowerSupplyFailure - https://phabricator.wikimedia.org/T346387 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Reseated powersupply cleared fault [14:06:31] (JobUnavailable) firing: (5) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:26] 10SRE, 10ops-eqiad, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10Jclark-ctr) [14:08:29] (03CR) 10Jelto: [C: 03+2] Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:09:06] (03PS1) 10Btullis: Add an nginx rule to block scripts from repositorygroup paths [puppet] - 10https://gerrit.wikimedia.org/r/958486 (https://phabricator.wikimedia.org/T318962) [14:09:22] (03Merged) 10jenkins-bot: Update miscweb to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:09:39] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:12:33] (03PS3) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T339890) [14:13:15] (03PS4) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) [14:13:18] (03CR) 10DCausse: rdf-streaming-updater: start adding per-env ZK path root (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [14:13:20] !log jelto@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [14:13:34] (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [14:15:02] (03PS3) 10Elukey: modules: copy configuration 1.4.1 to 1.5.0 for mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956440 (https://phabricator.wikimedia.org/T346638) [14:15:13] (03PS4) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638) [14:15:19] !log jelto@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [14:15:28] (03PS5) 10Elukey: modules: add configuration 1.5.0 to mesh [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638) [14:16:31] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:17:55] (03PS1) 10Giuseppe Lavagetto: wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709) [14:17:57] (03PS1) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709) [14:18:21] (03PS1) 10Giuseppe Lavagetto: Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) [14:18:34] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudbackup1001-dev.eqiad.wmnet with OS bookworm [14:18:45] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.codfw.wmnet with OS bookworm [14:19:37] (JobUnavailable) firing: (6) Reduced availability for job jmx_puppetdb in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:20] (03CR) 10Jelto: [C: 03+2] "deploy in staging looks good, proceeding with codfw and eqiad" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958476 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [14:20:52] (03PS2) 10Fabfur: add simple Makefile and README [software/purged] - 10https://gerrit.wikimedia.org/r/958477 [14:21:33] (03PS3) 10Fabfur: add simple Makefile and README [software/purged] - 10https://gerrit.wikimedia.org/r/958477 [14:21:41] !log jelto@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [14:22:14] PROBLEM - BFD status on cloudsw1-b1-codfw.mgmt is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:23:20] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:24:44] !log jelto@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [14:26:31] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:26:44] !log jelto@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [14:27:40] 10ops-codfw: InterfaceSpeedError - https://phabricator.wikimedia.org/T346450 (10Jhancock.wm) 05Open→03Resolved [14:27:59] (03PS1) 10Jclark-ctr: add dbstore1008-1009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) [14:28:46] (03CR) 10Jclark-ctr: [C: 03+2] add dbstore1008-1009 to site.pp [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) (owner: 10Jclark-ctr) [14:29:04] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1142.eqiad.wmnet with OS bullseye [14:29:58] !log jelto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [14:32:17] (03PS5) 10Arnaudb: icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) [14:32:51] !log use certmanager instead of certgen in miscweb namespace - T300033 [14:32:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:54] T300033: Use cert-manager for service-proxy certificate creation - https://phabricator.wikimedia.org/T300033 [14:34:13] (03CR) 10Jcrespo: icinga: add arnaudb to userlist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [14:38:30] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage [14:39:44] (03CR) 10Arnaudb: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [14:41:14] 10SRE, 10Phabricator, 10Security-Team, 10SecTeam-Processed, and 2 others: Require 2FA for members of acl*sre-team - https://phabricator.wikimedia.org/T328746 (10sbassett) 05In progress→03Resolved a:03Reedy >>! In T328746#9171419, @RLazarus wrote: > I don't have edit access to #acl_security. Thanks f... [14:41:36] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.codfw.wmnet with reason: host reimage [14:41:58] 10SRE, 10Phabricator, 10Security-Team, 10SecTeam-Processed, and 2 others: Require 2FA for members of acl*sre-team - https://phabricator.wikimedia.org/T328746 (10sbassett) [14:42:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Nat Hillard - https://phabricator.wikimedia.org/T342588 (10Milimetric) approved (sorry this slipped through) [14:42:44] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage [14:44:05] (03CR) 10Filippo Giunchedi: [C: 03+1] "Ship it!" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [14:44:52] (03CR) 10Jcrespo: [C: 03+1] icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [14:45:35] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [14:45:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye [14:45:42] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1036.eqiad.wmnet with OS bullseye [14:45:46] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1142.eqiad.wmnet with reason: host reimage [14:45:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1036.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [14:45:54] (03CR) 10Arnaudb: [C: 03+2] icinga: add arnaudb to userlist [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [14:47:56] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1143.eqiad.wmnet with OS bullseye [14:52:39] (03PS2) 10Ilias Sarantopoulos: alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) [14:54:18] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [14:54:25] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye [14:54:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes1038.eqiad.wmnet with OS bullseye [14:54:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1001 for host kubernetes1038.eqiad.wmnet with OS bullseye executed with errors: - kubernetes10... [14:55:02] (03CR) 10Ilias Sarantopoulos: alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [14:57:02] (03PS1) 10Brouberol: [eventgate-analytics-external] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) [14:57:04] (03PS1) 10Brouberol: [eventgate-analytics] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958497 (https://phabricator.wikimedia.org/T33604) [14:57:06] (03PS1) 10Brouberol: [eventstream-internal] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958498 (https://phabricator.wikimedia.org/T336041) [14:57:08] (03PS1) 10Brouberol: [mw-page-content-change-enrich] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958499 (https://phabricator.wikimedia.org/T336041) [14:58:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:58] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API not returning latest page when queried title is a redirect - https://phabricator.wikimedia.org/T346579 (10akosiaris) I 'll admit I am a bit stumped here. This is clearly not the CDN's fault as RESTBase exhibits the same behavior while also violating wh... [15:01:23] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage [15:03:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:04:20] (03PS3) 10AOkoth: vrts: add ticket-test on wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/957747 (https://phabricator.wikimedia.org/T340027) [15:04:27] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1143.eqiad.wmnet with reason: host reimage [15:04:38] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 11, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:04:44] RECOVERY - BFD status on cloudsw1-b1-codfw.mgmt is OK: UP: 5 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:11:21] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1142.eqiad.wmnet with OS bullseye [15:13:56] !log upload swift_2.26.0-10+deb11u1+wmf1_amd64.changes to apt1001 [15:13:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:56] !log depool ms-fe2009 to install new swift packages [15:18:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:56] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift-container-stats_mw-media.service,swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:23:19] (03CR) 10Btullis: [C: 03+2] Add an nginx rule to block scripts from repositorygroup paths [puppet] - 10https://gerrit.wikimedia.org/r/958486 (https://phabricator.wikimedia.org/T318962) (owner: 10Btullis) [15:24:47] (03PS9) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) [15:24:49] (03PS9) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [15:24:51] (03PS9) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [15:24:53] (03PS9) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [15:24:55] (03PS9) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [15:25:21] (03CR) 10Jcrespo: [C: 03+1] "@Filippo (godog): when testing the change, we can see the new user on the config file, but I think we were bitten (again) by https://phabr" [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [15:25:48] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:25:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1143.eqiad.wmnet with OS bullseye [15:26:41] !log repool ms-fe2009 with new swift packages [15:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:14] !log install new swift packages on ms-be2044 [15:27:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:35] (03CR) 10Jcrespo: [C: 03+1] icinga: add arnaudb to userlist (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957815 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [15:28:42] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host kubernetes1036 [15:29:35] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1038.eqiad.wmnet with OS bullseye [15:29:39] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1047.eqiad.wmnet with OS bullseye [15:29:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye [15:29:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye [15:29:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host kubernetes1036 [15:30:05] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1530). [15:30:45] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage [15:30:47] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage [15:31:12] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958512 (https://phabricator.wikimedia.org/T128546) [15:31:25] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958512 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:31:56] (03CR) 10SBassett: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [15:32:15] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958512 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [15:32:52] (03PS2) 10Brouberol: [eventgate-analytics] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958497 (https://phabricator.wikimedia.org/T336041) [15:32:54] (03PS2) 10Brouberol: [eventstream-internal] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958498 (https://phabricator.wikimedia.org/T336041) [15:32:56] (03PS2) 10Brouberol: [mw-page-content-change-enrich] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958499 (https://phabricator.wikimedia.org/T336041) [15:34:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1038.eqiad.wmnet with reason: host reimage [15:36:52] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1047.eqiad.wmnet with reason: host reimage [15:40:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:43:34] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1144.eqiad.wmnet with OS bullseye [15:44:29] !log jdrewniak@deploy1002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:958512| Bumping portals to master (T128546)]] (duration: 08m 45s) [15:44:32] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:45:03] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:45:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) pc1016 - C 6. U 31. port 30 CableID 3252 [15:46:10] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:49:42] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:03] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (POST pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:51:21] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:51:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes1036.eqiad.wmnet with OS bullseye [15:51:56] (03CR) 10Alexandros Kosiaris: [C: 03+1] profile::mediawiki::common: set default histogram buckets [puppet] - 10https://gerrit.wikimedia.org/r/954114 (https://phabricator.wikimedia.org/T344751) (owner: 10Herron) [15:52:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye [15:53:00] !log jdrewniak@deploy1002 Synchronized portals: Wikimedia Portals Update: [[gerrit:958512| Bumping portals to master (T128546)]] (duration: 08m 31s) [15:53:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:53:01] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1038.eqiad.wmnet with OS bullseye [15:53:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage [15:53:04] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [15:53:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1038.eqiad.wmnet with OS bullseye completed: - kubernetes1038 (**PAS... [15:53:29] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:55:03] (KubernetesAPILatency) resolved: (4) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:55:27] (03CR) 10Btullis: [C: 03+1] [eventgate-analytics-external] Add kafka-jumbo1010 to config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [15:56:14] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes1036.eqiad.wmnet with reason: host reimage [15:57:01] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage [15:57:04] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:57:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1047.eqiad.wmnet with OS bullseye [15:57:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1047.eqiad.wmnet with OS bullseye completed: - kubernetes1047 (**PAS... [15:58:27] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:59:29] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.284 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:59:52] 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero) [15:59:59] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [16:00:31] 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero) [16:01:10] 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero) [16:01:20] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero) [16:01:37] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1144.eqiad.wmnet with reason: host reimage [16:02:09] 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches) - https://phabricator.wikimedia.org/T346661 (10aborrero) [16:02:35] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1145.eqiad.wmnet with OS bullseye [16:03:16] 10SRE, 10ops-codfw, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol200[6-8]-dev, cloudnet200[7-8]-dev - https://phabricator.wikimedia.org/T342456 (10aborrero) [16:05:53] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:09:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) [16:10:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) [16:11:24] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:11:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) [16:12:13] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) [16:12:59] !log jnuche@deploy1002 Installing scap version "4.61.1" for 601 hosts [16:13:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [16:13:33] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes1036.eqiad.wmnet with OS bullseye [16:13:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kubernetes1036.eqiad.wmnet with OS bullseye completed: - kubernetes1036 (**PAS... [16:14:07] !log jnuche@deploy1002 Installation of scap version "4.61.1" completed for 601 hosts [16:14:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10User-aborrero, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10] cloudnet100[7-8] - https://phabricator.wikimedia.org/T342455 (10aborrero) These servers are going to be part of the `eqiad2dev` deployment, and should get the `-dev`prefix on them,... [16:15:32] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage [16:16:01] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) [16:17:55] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1145.eqiad.wmnet with reason: host reimage [16:23:24] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbstore1008'] [16:23:28] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbstore1009'] [16:23:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['dbstore1009'] [16:24:19] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['dbstore1009'] [16:25:03] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1144.eqiad.wmnet with OS bullseye [16:28:11] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbstore1008'] [16:30:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['dbstore1009'] [16:39:23] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:27] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1145.eqiad.wmnet with OS bullseye [16:42:09] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [16:43:04] (03CR) 10RobH: [C: 03+1] "I'm not sure if that role is going to work as its not specifically defined in the manifests for insetup role but we can give it a shot and" [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) (owner: 10Jclark-ctr) [16:46:11] (03CR) 10Volans: "post-merge -1, this role doesn't exists, the reimage will fail because puppet will fail" [puppet] - 10https://gerrit.wikimedia.org/r/958491 (https://phabricator.wikimedia.org/T342862) (owner: 10Jclark-ctr) [16:52:01] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:27] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1700) [17:00:05] ryankemper: OwO what's this, a deployment window?? Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T1700). nyaa~ [17:18:25] 10SRE, 10Privacy Engineering, 10Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409 (10ssingh) a:05ssingh→03None [17:18:36] 10SRE, 10Privacy Engineering, 10Traffic: Create and document Wikidough's privacy policy - https://phabricator.wikimedia.org/T275409 (10ssingh) a:03ssingh [17:21:06] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) a:05kostajh→03None [17:21:58] 10SRE, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Yak Shaving 🐃🪒): Have linters/tests results show up as comments in files on gerrit - https://phabricator.wikimedia.org/T209149 (10kostajh) https://wikitech.wikimedia.org/wiki/Tool:Fix_Suggester_Bot got some of the way there, but I do... [17:23:17] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Data-Platform-SRE, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10akosiaris) a:05akosiaris→03None [17:32:19] (03PS2) 10Brouberol: Add kafka-jumbo1010.eqiad.wmnet to apps config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) [17:33:16] (03PS1) 10RobH: dbstore insetup role adjustment [puppet] - 10https://gerrit.wikimedia.org/r/958531 (https://phabricator.wikimedia.org/T342862) [17:33:37] (03CR) 10RobH: [C: 03+2] dbstore insetup role adjustment [puppet] - 10https://gerrit.wikimedia.org/r/958531 (https://phabricator.wikimedia.org/T342862) (owner: 10RobH) [17:40:07] 10SRE, 10MediaWiki-Documentation, 10serviceops-radar, 10Documentation, and 2 others: Repair "svn.wikimedia.org/doc/" redirect for doc.wikimedia.org - https://phabricator.wikimedia.org/T109950 (10Dereckson) Who are the ones responsible for this review? [17:46:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:46:55] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [17:49:01] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.317 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:51:08] (03CR) 10Eevans: [C: 03+1] sre.maps.reboot: Retire legacy cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/958478 (https://phabricator.wikimedia.org/T317855) (owner: 10Muehlenhoff) [17:55:23] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:55:49] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:56:47] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:57:13] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:58:05] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:59:29] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:12:57] 10SRE-Sprint-Week-Sustainability-March2023, 10Deployments, 10serviceops-radar, 10Release-Engineering-Team (Radar), and 2 others: Remove provisioning for 'mwscript', 'foreachwikiindblist' etc from deployment host - https://phabricator.wikimedia.org/T253822 (10dancy) a:05dancy→03None [18:29:37] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:30:58] (03CR) 10Urbanecm: beta: Do not reference image-suggestion-api.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm) [18:33:56] (03CR) 10Eevans: [C: 03+2] Revert "install: Use from-scratch partman recipe for restbase1030" [puppet] - 10https://gerrit.wikimedia.org/r/956063 (owner: 10Eevans) [18:40:01] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:42:23] (03PS1) 10Milimetric: wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) [18:42:33] (03CR) 10Eevans: [C: 03+2] cassandra: remove cassandra/twcs deployment [puppet] - 10https://gerrit.wikimedia.org/r/955412 (https://phabricator.wikimedia.org/T341732) (owner: 10Eevans) [18:45:12] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) a:05RLazarus→03None [18:45:18] 10SRE, 10serviceops: Run httpbb periodically - https://phabricator.wikimedia.org/T289202 (10RLazarus) a:03RLazarus [18:45:34] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) a:05RLazarus→03None [18:45:44] 10SRE, 10Wikimedia-Apache-configuration, 10serviceops: Investigate and restore K.A.Z httpbb test - https://phabricator.wikimedia.org/T289022 (10RLazarus) a:03RLazarus [19:10:02] (03PS2) 10AOkoth: ats: add ticket-test [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) [19:10:04] (03PS1) 10AOkoth: vrts: vrts1002 change global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/958565 (https://phabricator.wikimedia.org/T340027) [19:14:10] (03CR) 10Brouberol: [C: 03+2] Add kafka-jumbo1010.eqiad.wmnet to apps config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [19:14:15] (03PS2) 10AOkoth: vrts: vrts1002 change global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/958565 (https://phabricator.wikimedia.org/T340027) [19:15:03] (03Merged) 10jenkins-bot: Add kafka-jumbo1010.eqiad.wmnet to apps config [deployment-charts] - 10https://gerrit.wikimedia.org/r/958496 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [19:22:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dbstore1008.eqiad.wmnet with OS bullseye [19:22:02] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host dbstore1009.eqiad.wmnet with OS bullseye [19:22:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye [19:22:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye [19:34:06] (03PS6) 10Ilias Sarantopoulos: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) [19:38:34] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [19:39:40] (03PS7) 10Ilias Sarantopoulos: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) [19:41:18] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [19:42:21] (03Merged) 10jenkins-bot: ml-services: increase memory for eswiki [deployment-charts] - 10https://gerrit.wikimedia.org/r/958052 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [19:43:49] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [19:45:27] RECOVERY - Check systemd state on stat1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:46:32] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [19:50:48] 10SRE: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10dr0ptp4kt) [19:51:38] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:56:16] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [19:56:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH inferenceservices) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:57:51] (03CR) 10Sergio Gimeno: [C: 03+1] Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm) [20:00:06] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: (Dis)respected human, time to deploy UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T2000). Please do the needful. [20:00:06] Dreamy_Jazz, Kizule, and Sergi0: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:17] (03CR) 10Bking: "@btullis @ryankemper are y'all OK with this approach? I've used the data-engineering team as a contact until we figure out contacts in T34" [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:00:31] hi [20:00:41] \o [20:02:06] hi i can deploy [20:02:11] :D [20:02:46] (03PS3) 10Clare Ming: clienthints: Pin wgCheckUserDisplayClientHints to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz) [20:03:27] (03CR) 10Clare Ming: [C: 03+2] Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm) [20:03:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz) [20:04:50] (03Merged) 10jenkins-bot: clienthints: Pin wgCheckUserDisplayClientHints to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958024 (https://phabricator.wikimedia.org/T337942) (owner: 10Dreamy Jazz) [20:05:05] !log cjming@deploy1002 Started scap: Backport for [[gerrit:958024|clienthints: Pin wgCheckUserDisplayClientHints to false (T337942)]] [20:05:11] T337942: Display client hint data - https://phabricator.wikimedia.org/T337942 [20:05:18] (03CR) 10Btullis: [C: 04-1] "Sorry I haven't had much of a chance to address this with you yet, but I don't think it's ready yet." [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:05:47] (03PS2) 10Clare Ming: clienthints: Enable purging of data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz) [20:06:07] Thanks. I won't be able to test this one as this config does not exist but will do once a patch that depends on this config change is merged. [20:06:13] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage [20:06:14] As such there isn't anything to test [20:06:29] Dreamy_Jazz: roger that - i'll go ahead and sync then [20:06:32] !log cjming@deploy1002 cjming and dreamyjazz: Backport for [[gerrit:958024|clienthints: Pin wgCheckUserDisplayClientHints to false (T337942)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:06:38] !log cjming@deploy1002 cjming and dreamyjazz: Continuing with sync [20:09:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: host reimage [20:12:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage [20:13:23] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:958024|clienthints: Pin wgCheckUserDisplayClientHints to false (T337942)]] (duration: 08m 18s) [20:13:27] T337942: Display client hint data - https://phabricator.wikimedia.org/T337942 [20:13:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz) [20:14:45] (03Merged) 10jenkins-bot: clienthints: Enable purging of data on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958025 (https://phabricator.wikimedia.org/T257893) (owner: 10Dreamy Jazz) [20:15:00] !log cjming@deploy1002 Started scap: Backport for [[gerrit:958025|clienthints: Enable purging of data on all wikis (T257893)]] [20:15:07] T257893: [EPIC] Support User-Agent Client Hints header in CheckUser - https://phabricator.wikimedia.org/T257893 [20:15:13] Thanks. I'll not really be able to test this one either as it relies on jobs that are queued up and I'm not sure that those jobs are sent to mwdebug servers? [20:15:28] Plus it's a random chance as to whether the job is queued. [20:15:59] Dreamy_Jazz: sounds good -- your 1st patch should be live and i'll sync your 2nd patch shortly [20:16:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1008.eqiad.wmnet with reason: host reimage [20:16:05] Great. [20:16:25] Kizule: are you here for your patch? [20:16:27] !log cjming@deploy1002 cjming and dreamyjazz: Backport for [[gerrit:958025|clienthints: Enable purging of data on all wikis (T257893)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:16:38] !log cjming@deploy1002 cjming and dreamyjazz: Continuing with sync [20:17:34] Sergi0: i'll proceed with yours next [20:17:43] great [20:23:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:23:44] (03Merged) 10jenkins-bot: Link recommendations: prevent too large offsets in cirrus queries [extensions/GrowthExperiments] (wmf/1.41.0-wmf.26) - 10https://gerrit.wikimedia.org/r/957871 (https://phabricator.wikimedia.org/T345713) (owner: 10Urbanecm) [20:24:10] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:24:25] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:958025|clienthints: Enable purging of data on all wikis (T257893)]] (duration: 09m 24s) [20:24:28] T257893: [EPIC] Support User-Agent Client Hints header in CheckUser - https://phabricator.wikimedia.org/T257893 [20:24:56] !log cjming@deploy1002 Started scap: Backport for [[gerrit:957871|Link recommendations: prevent too large offsets in cirrus queries (T345713)]] [20:24:59] T345713: fixLinkRecommendationData script yields cirrussearch-offset-too-large - https://phabricator.wikimedia.org/T345713 [20:25:05] Dreamy_Jazz: 2nd patch is live [20:25:21] Thanks! [20:25:29] yw! [20:25:48] 10SRE, 10SRE-Access-Requests: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt) [20:26:26] !log cjming@deploy1002 urbanecm and cjming: Backport for [[gerrit:957871|Link recommendations: prevent too large offsets in cirrus queries (T345713)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:26:31] sergi0: is your patch testable? [20:26:59] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [20:27:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:27:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1009.eqiad.wmnet with OS bullseye [20:27:12] yes, I'll try to run the script once in the debug server, we don't need to wait for it to finish [20:27:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1009.eqiad.wmnet with OS bullseye completed: -... [20:28:08] (in dry mode) [20:28:23] sergi0: i'll wait for your greenlight to sync [20:28:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:28:53] (03PS1) 10Dr0ptp4kt: dr0ptp4kt WDQS, Search, Analytics access [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694) [20:29:26] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:29:47] seems ok on my end [20:29:52] great - syncing [20:29:57] !log cjming@deploy1002 urbanecm and cjming: Continuing with sync [20:30:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:30:32] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1008.eqiad.wmnet with OS bullseye [20:30:38] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host dbstore1008.eqiad.wmnet with OS bullseye completed: -... [20:31:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) [20:32:34] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install dbstore100[89] - https://phabricator.wikimedia.org/T342862 (10Jhancock.wm) 05Open→03Resolved @Btullis completed [20:36:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:36:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:36:36] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:957871|Link recommendations: prevent too large offsets in cirrus queries (T345713)]] (duration: 11m 40s) [20:36:40] T345713: fixLinkRecommendationData script yields cirrussearch-offset-too-large - https://phabricator.wikimedia.org/T345713 [20:36:47] sergi0: should be live! [20:37:01] cool. Thank you! [20:37:12] np! [20:37:36] i'll keep the window open for a few more minutes in case Kizule shows up [20:38:39] (03PS3) 10Clare Ming: Enable WikiLove on arwikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957842 (https://phabricator.wikimedia.org/T346391) (owner: 10Zoranzoki21) [20:41:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext (canary) at eqiad - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:41:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [20:42:18] (03PS4) 10C. Scott Ananian: Re-enable Extension:ParserMigration on labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/944978 (https://phabricator.wikimedia.org/T333179) (owner: 10Sbailey) [20:44:07] (03CR) 10Dr0ptp4kt: [C: 03+1] wikireplicas: add user_is_temp column to user view [puppet] - 10https://gerrit.wikimedia.org/r/958543 (https://phabricator.wikimedia.org/T346679) (owner: 10Milimetric) [20:49:18] !log end of UTC late backport window [20:49:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:45] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:52:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) Regarding public keys: Both are now published on the office wiki: https://office.wikimedia.org/wiki/User:AHoelzl-WMF I don't seem to have an AHoelzl-WMF account:... [20:53:09] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:55:16] (03PS5) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [20:56:28] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [20:59:15] (03PS6) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [21:00:05] Reedy, sbassett, Maryum, and manfredi: (Dis)respected human, time to deploy Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230918T2100). Please do the needful. [21:00:27] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:13:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1003.eqiad.wmnet [21:15:23] (03PS1) 10Ryan Kemper: wdqs: decom old canary wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) [21:15:35] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:15:58] (03PS2) 10Ryan Kemper: wdqs: decom old canary wdqs1003 [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) [21:16:59] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:17:35] (03PS3) 10Ryan Kemper: wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) [21:17:49] (03PS4) 10Ryan Kemper: wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) [21:18:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:19:51] !log Deployed patch for T344359 [21:19:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:20:48] 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T346699 (10RKemper) [21:22:23] (03CR) 10Bking: [C: 03+1] wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper) [21:22:29] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs: decom wdqs100[3,4] [puppet] - 10https://gerrit.wikimedia.org/r/958572 (https://phabricator.wikimedia.org/T344198) (owner: 10Ryan Kemper) [21:23:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:29:09] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:30:51] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jul-Sep-2023), 10Datacenter-Switchover: CommRel support for September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345265 (10UOzurumba) [21:36:55] (03CR) 10Btullis: Increase the kafka-jumbo maximum message size to 10 MB (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [21:40:10] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [21:45:56] !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [21:47:21] (03PS7) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [21:48:33] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:48:49] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:49:03] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:49:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [21:49:17] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:49:18] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1003.eqiad.wmnet [21:50:13] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:50:23] (03PS8) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [21:50:27] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:51:19] !log ryankemper@cumin1001 START - Cookbook sre.hosts.decommission for hosts wdqs1004.eqiad.wmnet [21:51:35] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:54:47] (03PS9) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [21:56:00] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [21:58:53] (03PS10) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [21:59:15] !log ryankemper@cumin1001 START - Cookbook sre.dns.netbox [22:00:40] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [22:01:13] (03PS11) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [22:02:25] (03CR) 10CI reject: [V: 04-1] prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) (owner: 10Bking) [22:06:41] (03PS12) 10Bking: prometheus-analytics: create alerts for new ZK cluster [alerts] - 10https://gerrit.wikimedia.org/r/945640 (https://phabricator.wikimedia.org/T341792) [22:07:54] !log ryankemper@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [22:08:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wdqs1004.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - ryankemper@cumin1001" [22:08:58] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [22:08:59] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wdqs1004.eqiad.wmnet [22:31:31] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:46:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (10Jhancock.wm) 05Open→03Resolved [22:48:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:50:13] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:52:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:54:57] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:55:17] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [22:55:41] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.329 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:28:59] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:30:23] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:31:19] PROBLEM - Check systemd state on an-tool1005 is CRITICAL: CRITICAL - degraded: The following units failed: superset.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:32:53] PROBLEM - restbase endpoints health on restbase2019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:34:17] RECOVERY - restbase endpoints health on restbase2019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:37:35] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:38:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:45:55] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 7.253 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:46:43] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:52:55] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:54:19] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:59:07] RECOVERY - Check systemd state on an-tool1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state