[00:01:05] RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:10:31] (Wikidata Reliability Metrics - Median loading time alert) firing: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:24:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:24:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:25:19] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:30:31] (Wikidata Reliability Metrics - Median loading time alert) resolved: - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert [00:32:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:35:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:36:51] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:38:14] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957818 [00:38:20] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957818 (owner: 10TrainBranchBot) [00:51:49] RECOVERY - dump of s5 in codfw on backupmon1001 is OK: Last dump for s5 at codfw (db2101) taken on 2023-09-19 00:00:15 (61 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [00:52:53] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957818 (owner: 10TrainBranchBot) [00:56:21] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [00:57:45] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:08:49] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346708 (10phaultfinder) [01:09:47] RECOVERY - dump of s5 in eqiad on backupmon1001 is OK: Last dump for s5 at eqiad (db1216) taken on 2023-09-19 00:00:03 (61 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:15:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:40:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:46:31] (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:59] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:51:21] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0200) [02:04:37] (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:32] (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:06:52] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.27 [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/957819 (https://phabricator.wikimedia.org/T345888) [02:06:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.27 [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/957819 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [02:09:37] (JobUnavailable) firing: (9) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:14:37] (JobUnavailable) firing: (10) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:20:52] (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.27 [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/957819 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [02:21:32] (JobUnavailable) firing: (10) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:36:31] (JobUnavailable) firing: (10) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:03] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:41:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:42:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [02:50:15] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:52:47] PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:54:09] RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:56:31] (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0300) [03:01:25] (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958595 (https://phabricator.wikimedia.org/T345888) [03:01:27] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958595 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [03:02:09] (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958595 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [03:02:43] !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.27 refs T345888 [03:02:47] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [03:04:37] (JobUnavailable) firing: (7) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:11:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:14:59] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:20:23] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [03:25:44] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL [03:27:12] that broke a mediawiki train [03:27:25] see eg https://gerrit.wikimedia.org/r/c/operations/puppet/+/927674 [03:30:45] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner [03:37:40] (03Abandoned) 10Anzx: add extranamespacenames for kannada-kn language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958050 (https://phabricator.wikimedia.org/T346583) (owner: 10Anzx) [04:01:47] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:03:11] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:03:49] !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.27 refs T345888 (duration: 61m 05s) [04:03:59] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [04:06:01] !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.25 (duration: 02m 10s) [04:08:43] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:10:05] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:14:29] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:28:31] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy eswiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958598 (https://phabricator.wikimedia.org/T346445) [04:31:18] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy eswiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958598 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [04:32:30] (03Merged) 10jenkins-bot: ml-services: deploy eswiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958598 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [04:35:29] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [04:39:07] PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [04:40:29] RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:15:41] "2023-09-19 05:07:47: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError"" [05:15:42] hmm [05:24:20] (03Restored) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx) [05:32:59] (03PS4) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) [05:43:07] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:43:15] PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:43:25] PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:43:31] PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:43:49] PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [05:45:40] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P52522 and previous config saved to /var/cache/conftool/dbconfig/20230919-054539-root.json [05:46:25] !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1134.eqiad.wmnet onto db1128.eqiad.wmnet [05:46:31] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:48:04] (03PS1) 10Marostegui: db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/958601 [05:48:33] (03CR) 10Marostegui: [C: 03+2] db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/958601 (owner: 10Marostegui) [05:48:44] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices2004-dev.codfw.wmnet with OS bookworm [05:54:27] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:54:55] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:55:35] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [05:55:37] RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:55:47] RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 21 Jul 2025 03:04:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [05:55:51] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [05:55:53] RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [05:56:31] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0600). [06:09:30] !log push new pfw policy - T346705 [06:09:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:17] (03CR) 10Ayounsi: [C: 03+1] Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [06:16:39] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:18:05] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:27:41] (03PS1) 10Andrea Denisse: prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) [06:29:39] (03CR) 10CI reject: [V: 04-1] prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [06:32:19] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:33:19] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [06:35:16] !log updating PCC facts [06:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:17] (03PS1) 10Ilias Sarantopoulos: ml-services: update kserve 0.11 in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958808 (https://phabricator.wikimedia.org/T346445) [06:39:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:44:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [06:51:54] (03PS1) 10Giuseppe Lavagetto: Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) [06:52:17] (03PS2) 10Giuseppe Lavagetto: kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463 [06:52:19] (03PS2) 10Giuseppe Lavagetto: wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709) [06:52:21] (03PS2) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709) [06:52:23] (03PS1) 10Giuseppe Lavagetto: kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) [06:52:25] (03PS1) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) [06:54:51] (03PS2) 10KartikMistry: Disable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) [06:59:07] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update kserve 0.11 in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958808 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [06:59:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10MoritzMuehlenhoff) Also needs approval by @Gehel being the approver for elasticsearch-roots et al. [07:00:07] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (once approvals on the Phab tasks are in)" [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694) (owner: 10Dr0ptp4kt) [07:00:08] Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0700) [07:00:08] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:24] (03Merged) 10jenkins-bot: ml-services: update kserve 0.11 in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958808 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos) [07:02:14] * kart_ is here and will self deploy.. [07:02:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) (owner: 10KartikMistry) [07:03:16] (03Merged) 10jenkins-bot: Disable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) (owner: 10KartikMistry) [07:04:10] !log kartik@deploy1002 Started scap: Backport for [[gerrit:956078|Disable Special:Contribute on bnwiki (T345772)]] [07:04:14] T345772: Disable Special:Contribute on bnwiki - https://phabricator.wikimedia.org/T345772 [07:06:09] (03CR) 10Arnaudb: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/957820 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [07:06:11] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [07:06:22] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [07:07:23] (03CR) 10Hashar: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/927674/2301/" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [07:08:35] (03CR) 10Jelto: "Adding Jaime because we remove the backup of static-codereview files. The files live in Git now: https://gitlab.wikimedia.org/repos/sre/mi" [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [07:09:29] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff) >>! In T335879#9173531, @Volans wrote: > This leave us just with two options: > * catch the exception in the cookbooks... [07:11:58] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [07:15:11] scap seems stuck since last 5 minutes at: `07:06:02 K8s images build/push output redirected to /home/kartik/scap-image-build-and-push-log` or is it normal? [07:16:10] (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/927675/2317/" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [07:16:12] (03PS1) 10Ayounsi: LibreNMS report: remove MODEL_EXCLUDES filter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 [07:16:14] (03PS1) 10Ayounsi: LibreNMS report: add equivalent model strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) [07:16:16] (03PS1) 10Ayounsi: LibreNMS report: use black formating [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 [07:16:38] (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/927676/2318/" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar) [07:21:02] In any case, scap seems super slow today? [07:22:02] always on tuesday [07:22:12] kart_: it is busy synchronizing the new wmf version that got cut over night [07:22:23] ah. [07:22:47] I'm deploying config change and it took 20 minutes and not yet reached to debug servers :/ [07:24:40] MW deployment window seems late today? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1800 [07:26:52] !log kartik@deploy1002 kartik: Backport for [[gerrit:956078|Disable Special:Contribute on bnwiki (T345772)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [07:26:58] T345772: Disable Special:Contribute on bnwiki - https://phabricator.wikimedia.org/T345772 [07:27:39] !log kartik@deploy1002 kartik: Continuing with sync [07:32:11] (03PS1) 10Marostegui: Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/958417 [07:33:13] (03PS4) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) [07:33:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:36:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43340/console" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [07:38:34] (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [07:39:10] (03PS5) 10Slyngshede: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) [07:39:18] (03CR) 10Klausman: [C: 03+1] alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [07:40:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43341/console" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [07:40:47] (03CR) 10Slyngshede: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [07:42:59] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:956078|Disable Special:Contribute on bnwiki (T345772)]] (duration: 38m 49s) [07:43:03] T345772: Disable Special:Contribute on bnwiki - https://phabricator.wikimedia.org/T345772 [07:43:23] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:44:16] (03CR) 10Volans: [C: 03+1] "If tested that Q() behaves as expected looks good to me." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 (owner: 10Ayounsi) [07:44:47] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:51:07] !log installing libwep security updates on buster [07:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:12] !log installing libwebp security updates on buster [07:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:41] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:55:03] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:56:58] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS bullseye [07:59:51] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [08:00:25] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [08:02:05] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [08:02:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:02:36] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [08:04:46] !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1134.eqiad.wmnet onto db1128.eqiad.wmnet [08:05:02] !log restarting FPM on mw canaries to pick up libwebp updates [08:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:11] !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [08:05:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.429 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:05:28] !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [08:10:13] (03CR) 10Slyngshede: [C: 03+2] P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [08:10:15] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [08:10:36] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage [08:11:41] !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm2001.wikimedia.org with OS bookworm [08:11:51] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bookworm [08:12:52] (03PS2) 10JMeybohm: kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) [08:13:04] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage [08:16:45] 10SRE: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10Peachey88) [08:17:32] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [08:18:09] kart_: is that still syncing? I think scap/rsync/whatever has some issues indeed [08:18:20] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [08:18:22] at a quick glance it seems the rsync are taking way longer than usual, but I have to dig the logs [08:18:42] !log redeploying eventstream-analytics in codfw T336041 [08:18:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:45] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:18:53] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [08:19:44] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [08:20:34] !log redeploying eventstream-analytics-external in eqiad T336041 [08:20:36] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [08:20:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:16] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [08:21:41] !log redeploying eventstream-analytics-external in codfw T336041 [08:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:45] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [08:22:25] hashar: no. It is done. [08:22:35] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [08:22:57] PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:23:10] !log redeploying eventstreams-internal in codfw T336041 [08:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:24] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [08:23:33] scap cdb rebuild went from a steady 120 seconds median time to 180 seconds last week and 230 seconds this week [08:23:52] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [08:24:12] !log redeploying eventstreams-internal in eqiad T336041 [08:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:24:15] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [08:24:15] RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:24:19] T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041 [08:24:39] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [08:25:38] !log redeploying mw-page-content-change-enrich in eqiad T336041 [08:25:39] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw [08:25:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:49] !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [08:26:03] !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:26:36] !log redeploying mw-page-content-change-enrich in codfw T336041 [08:26:36] !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm2001.wikimedia.org with reason: host reimage [08:26:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:45] !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [08:26:55] !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [08:27:40] (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [08:28:51] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:29:06] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm2001.wikimedia.org with reason: host reimage [08:30:42] (03CR) 10Muehlenhoff: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup) [08:30:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw [08:32:22] (03PS2) 10JMeybohm: kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) [08:34:28] !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad [08:35:11] (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [08:36:16] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1146.eqiad.wmnet with OS bullseye [08:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad [08:41:27] !log remove MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from graphite - T346371 [08:41:29] (03PS1) 10Slyngshede: P:idm fix log file path. [puppet] - 10https://gerrit.wikimedia.org/r/958888 [08:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:31] T346371: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite - https://phabricator.wikimedia.org/T346371 [08:42:11] 10SRE, 10Growth-Team, 10Observability-Metrics, 10Graphite: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite - https://phabricator.wikimedia.org/T346371 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you for reaching out @Urbanecm_WMF and letting us know about... [08:42:50] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43342/console" [puppet] - 10https://gerrit.wikimedia.org/r/958888 (owner: 10Slyngshede) [08:43:53] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet [08:44:34] !log bounce benthos@webrequest_live to clear out old metrics [08:44:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:44:51] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm fix log file path. [puppet] - 10https://gerrit.wikimedia.org/r/958888 (owner: 10Slyngshede) [08:47:46] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet [08:48:12] 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10aborrero) [08:48:52] 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10aborrero) a:03cmooney hey @cmooney could you please advice on the cloudswitch models we would need in codfw to expand our capacity in that DCs? [08:49:06] (03CR) 10Vgutierrez: [C: 04-1] "tests aren't happy:" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [08:51:59] (03CR) 10JMeybohm: [C: 03+1] Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [08:52:02] (03CR) 10JMeybohm: [C: 03+1] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [08:53:48] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10aborrero) [08:54:14] (03CR) 10JMeybohm: [C: 03+1] wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto) [08:54:37] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10aborrero) [08:54:58] (03CR) 10Muehlenhoff: [C: 03+2] netbox: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/950167 (owner: 10Muehlenhoff) [08:55:32] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:56:25] 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725 (10aborrero) [08:56:52] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:58:40] (03CR) 10JMeybohm: [C: 03+1] Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [08:59:51] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet [09:00:02] (03CR) 10JMeybohm: [C: 03+1] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [09:00:05] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43348/console" [puppet] - 10https://gerrit.wikimedia.org/r/957803 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [09:01:25] (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm, default Gemfile has 644 as well." [puppet] - 10https://gerrit.wikimedia.org/r/957803 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall) [09:02:31] (03CR) 10JMeybohm: [C: 04-1] kubernetes: add kubernetes10[27-56] to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto) [09:03:20] (03CR) 10Arnaudb: "unnecessary change" [puppet] - 10https://gerrit.wikimedia.org/r/957820 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [09:03:25] (03Abandoned) 10Arnaudb: icinga: fix Arnaudb on icinga userlist [puppet] - 10https://gerrit.wikimedia.org/r/957820 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb) [09:03:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet [09:04:34] (03CR) 10Elukey: [C: 03+1] Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:06:04] (03CR) 10Elukey: [C: 03+1] mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:08:14] !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet [09:08:23] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [09:08:48] (03Abandoned) 10Arturo Borrero Gonzalez: cloudgw: add NFS ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez) [09:11:40] 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10ABran-WMF) 05Open→03Resolved [09:12:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet [09:13:56] (03CR) 10Jcrespo: [C: 03+1] "Re: backups- looks good to me, no other references, should be safe to deploy, and files will continue in its current format for 3 months o" [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [09:14:40] 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10LSobanski) [09:19:38] 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10dcaro) This might be interesting also for the cloud services projects (toolforge/cloudvps/...) as we have to manage also many abandoned/unresponsive developer accounts and... [09:24:08] (03CR) 10Jelto: [C: 03+2] phabricator: Stop logging Bugzilla redirector misses [puppet] - 10https://gerrit.wikimedia.org/r/952047 (https://phabricator.wikimedia.org/T344884) (owner: 10Aklapper) [09:28:04] (03CR) 10Btullis: [C: 03+2] Increase the kafka-jumbo maximum message size to 10 MB [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis) [09:29:42] (03PS1) 10Slyngshede: P:IDM Use default MySQL backend on package installation. [puppet] - 10https://gerrit.wikimedia.org/r/958892 [09:29:45] (03CR) 10Jelto: [C: 03+2] miscweb/microsites: move monitoring of static-codereview to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/958474 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [09:30:28] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: apply the patch to override cloud.yaml on the cli [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro) [09:30:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43351/console" [puppet] - 10https://gerrit.wikimedia.org/r/958892 (owner: 10Slyngshede) [09:32:39] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43353/console" [puppet] - 10https://gerrit.wikimedia.org/r/958892 (owner: 10Slyngshede) [09:32:47] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Use default MySQL backend on package installation. [puppet] - 10https://gerrit.wikimedia.org/r/958892 (owner: 10Slyngshede) [09:33:04] (03PS1) 10Muehlenhoff: Switch netboxdb to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/958894 [09:33:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958894 (owner: 10Muehlenhoff) [09:35:28] (03PS1) 10David Caro: openstack: fix source path for cli patch [puppet] - 10https://gerrit.wikimedia.org/r/958895 [09:35:50] (03CR) 10Marostegui: [C: 03+2] Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/958417 (owner: 10Marostegui) [09:35:53] (03CR) 10David Caro: [C: 03+1] openstack: fix source path for cli patch [puppet] - 10https://gerrit.wikimedia.org/r/958895 (owner: 10David Caro) [09:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52523 and previous config saved to /var/cache/conftool/dbconfig/20230919-093622-root.json [09:36:46] (03Abandoned) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [09:36:50] (03PS1) 10Slyngshede: P:IDM Add mysql Python driver [puppet] - 10https://gerrit.wikimedia.org/r/958896 [09:37:24] (03CR) 10David Caro: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43354/console" [puppet] - 10https://gerrit.wikimedia.org/r/958895 (owner: 10David Caro) [09:38:12] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43355/console" [puppet] - 10https://gerrit.wikimedia.org/r/958896 (owner: 10Slyngshede) [09:38:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/958896 (owner: 10Slyngshede) [09:38:24] (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: fix source path for cli patch [puppet] - 10https://gerrit.wikimedia.org/r/958895 (owner: 10David Caro) [09:38:59] (03PS1) 10Elukey: WIP: Improve ORES-Legacy's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 [09:39:12] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Add mysql Python driver [puppet] - 10https://gerrit.wikimedia.org/r/958896 (owner: 10Slyngshede) [09:40:02] !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [09:40:08] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:41:28] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:42:50] (03PS1) 10JMeybohm: Drop kubernetes cergen certs [labs/private] - 10https://gerrit.wikimedia.org/r/958898 (https://phabricator.wikimedia.org/T329826) [09:42:52] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1147.eqiad.wmnet with OS bullseye [09:44:56] (03PS2) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) [09:45:51] (03PS3) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) [09:48:35] !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm2001.wikimedia.org with OS bookworm [09:48:44] 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bookworm completed: - idm2001 (*... [09:50:00] PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:51:24] RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [09:51:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 3%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52524 and previous config saved to /var/cache/conftool/dbconfig/20230919-095127-root.json [09:56:13] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Drop kubernetes cergen certs [labs/private] - 10https://gerrit.wikimedia.org/r/958898 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [09:56:20] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) (owner: 10Ayounsi) [09:56:29] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage [09:56:32] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:56:40] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 (owner: 10Ayounsi) [09:58:26] !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1148.eqiad.wmnet with OS bullseye [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1000) [10:01:33] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage [10:05:54] PROBLEM - puppet last run on puppetdb1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604990 seconds, message: Stop Puppet/Puppetdb/Postgres to ensure nothing hits the legacy servers, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:06:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52525 and previous config saved to /var/cache/conftool/dbconfig/20230919-100632-root.json [10:06:53] (03CR) 10JMeybohm: [C: 03+2] Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:06:55] (03CR) 10JMeybohm: [C: 03+2] mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:06:58] (03CR) 10JMeybohm: [C: 03+2] Copy mesh.certificate_1.0.0 to 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/958482 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:07:41] (03Merged) 10jenkins-bot: Copy mesh.certificate_1.0.0 to 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/958482 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:07:55] (03Merged) 10jenkins-bot: mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:07:57] (03Merged) 10jenkins-bot: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:11:59] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage [10:12:26] (03PS7) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [10:12:50] PROBLEM - puppet last run on puppetdb2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604896 seconds, message: Stop Puppet/Puppetdb/Postgres/Nginx/microservice to ensure nothing hits the legacy servers, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:15:01] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage [10:20:09] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43361/console" [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [10:21:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52526 and previous config saved to /var/cache/conftool/dbconfig/20230919-102137-root.json [10:25:00] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1147.eqiad.wmnet with OS bullseye [10:28:02] (03CR) 10Milimetric: [C: 03+2] Fix typo in Jade content type name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [10:29:44] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [10:30:55] (03CR) 10Majavah: Fix typo in Jade content type name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [10:32:39] (03CR) 10Ladsgroup: Fix typo in Jade content type name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup) [10:34:16] (03CR) 10JMeybohm: [C: 04-1] "I would make this a patchlevel version tbh. It's fully backwards compatible, does not change anything except internal logic and patchlevel" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [10:34:20] !log codfw swift front-end swift package updates T346730 [10:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:24] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [10:36:30] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [10:36:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52527 and previous config saved to /var/cache/conftool/dbconfig/20230919-103642-root.json [10:38:15] (03CR) 10JMeybohm: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [10:38:29] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1148.eqiad.wmnet with OS bullseye [10:40:07] (03PS5) 10JMeybohm: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [10:40:13] (03CR) 10JMeybohm: [C: 04-1] profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [10:40:18] (03CR) 10Muehlenhoff: [C: 03+2] Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [10:42:05] PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift [10:42:17] PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:42:37] PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Swift [10:43:59] RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Swift [10:44:53] RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift [10:45:05] RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:47] (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) [10:46:55] (03CR) 10JMeybohm: [V: 03+1 C: 04-1] "PCC SUCCESS (CORE_DIFF 26): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43364/console" [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [10:51:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52528 and previous config saved to /var/cache/conftool/dbconfig/20230919-105147-root.json [10:54:02] (03PS4) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) [10:57:16] (03PS1) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) [10:58:37] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [10:58:56] (03PS6) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) [10:59:59] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [11:00:14] (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [11:01:25] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:02:38] (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) [11:02:56] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [11:03:00] (03PS5) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) [11:03:43] (03CR) 10Kamila Součková: [C: 03+1] wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui) [11:04:02] (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) [11:05:13] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [11:06:43] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [11:06:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52529 and previous config saved to /var/cache/conftool/dbconfig/20230919-110651-root.json [11:09:18] !log eqiad swift front-end swift package updates T346730 [11:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:22] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [11:10:38] (03PS2) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) [11:11:49] (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) [11:12:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:12:17] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [11:12:31] (03PS1) 10Slyngshede: C:idm::redis Allow replication via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958907 [11:14:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez) [11:16:33] PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift [11:16:45] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Swift [11:17:27] PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:18:27] (03PS3) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) [11:20:13] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:21:41] RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:21:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52530 and previous config saved to /var/cache/conftool/dbconfig/20230919-112156-root.json [11:25:41] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift [11:27:37] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Swift [11:31:25] (03PS1) 10Muehlenhoff: conntrackd: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) [11:31:49] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:36:08] (03PS1) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042) [11:36:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] conntrackd: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:37:26] (03PS2) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042) [11:38:30] (03CR) 10Muehlenhoff: [C: 03+2] conntrackd: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:42:22] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [11:45:38] 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) I misclicked on netbox and deleted the whole device entry for cloudservices1005, meaning it is no longer registere... [11:46:19] (03PS4) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) [11:46:34] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report: remove MODEL_EXCLUDES filter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 (owner: 10Ayounsi) [11:46:39] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report: add equivalent model strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) (owner: 10Ayounsi) [11:46:43] (03CR) 10Ayounsi: [C: 03+2] LibreNMS report: use black formating [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 (owner: 10Ayounsi) [11:47:02] (03CR) 10Clément Goubert: [C: 03+1] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková) [11:47:08] (03Merged) 10jenkins-bot: LibreNMS report: remove MODEL_EXCLUDES filter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 (owner: 10Ayounsi) [11:47:14] (03Merged) 10jenkins-bot: LibreNMS report: add equivalent model strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) (owner: 10Ayounsi) [11:47:17] (03Merged) 10jenkins-bot: LibreNMS report: use black formating [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 (owner: 10Ayounsi) [11:48:08] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [11:49:11] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [11:50:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [11:50:54] (03PS8) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [11:51:00] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [11:51:06] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [11:55:40] (03PS2) 10Slyngshede: IDM Switchover [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721) [11:56:17] (03PS4) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [11:58:09] (03PS1) 10David Caro: m:openstack::clientpackages*: add patch to the list of packages [puppet] - 10https://gerrit.wikimedia.org/r/958917 [12:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1200) [12:00:17] (03PS9) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [12:00:24] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43374/console" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro) [12:01:42] (03CR) 10David Caro: [V: 03+1] "PCC looks good, a bit annoying that `present` moved to `installed` xd, but it's ok" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro) [12:02:50] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10ayounsi) 05Resolved→03Open a:05Papaul→03Jhancock.wm This triggered netbox report alert ganeti2014 (WMF6747) mismatched serials: XXXXX (netbox) != YYYYY (puppetdb) https://netbox.wikimedia.org/extras/r... [12:03:32] !log jebe@deploy1002 Started deploy [analytics/refinery@91bb4a0]: Regular analytics weekly train [analytics/refinery@91bb4a0] [12:05:40] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [12:08:10] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [12:09:39] (03PS1) 10Muehlenhoff: conntrackd: Add explicit check [puppet] - 10https://gerrit.wikimedia.org/r/958918 (https://phabricator.wikimedia.org/T336497) [12:09:43] jouncebot: nowandnext [12:09:43] For the next 0 hour(s) and 50 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1200) [12:09:44] In 0 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1300) [12:09:59] (03PS2) 10Urbanecm: beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) [12:10:19] (03CR) 10Urbanecm: [C: 03+2] beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm) [12:10:25] !log jebe@deploy1002 Finished deploy [analytics/refinery@91bb4a0]: Regular analytics weekly train [analytics/refinery@91bb4a0] (duration: 06m 53s) [12:11:01] (03Merged) 10jenkins-bot: beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm) [12:12:21] 10SRE, 10Infrastructure-Foundations, 10netops: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) 05Open→03Resolved a:03cmooney Closing. If we want to do it on EVPN/VXLAN devices we can revisit in future. [12:12:23] !log ms-be204{5,6} swift package updates T346730 [12:12:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:28] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [12:13:17] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43376/console" [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede) [12:14:34] !log ms-be2047 swift package updates T346730 [12:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:17] (03CR) 10Slyngshede: "Not currently a huge problem as everything in Redis is ephemeral, but it should still work." [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede) [12:16:17] (03PS1) 10Kamila Součková: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) [12:16:20] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958918 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff) [12:17:04] !log jebe@deploy1002 Started deploy [analytics/refinery@91bb4a0] (thin): Regular analytics weekly train THIN [analytics/refinery@91bb4a0] [12:17:09] !log jebe@deploy1002 Finished deploy [analytics/refinery@91bb4a0] (thin): Regular analytics weekly train THIN [analytics/refinery@91bb4a0] (duration: 00m 05s) [12:17:19] (03CR) 10Muehlenhoff: C:idm::redis Allow replication via IPv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede) [12:17:23] (03CR) 10Kamila Součková: [C: 04-2] "to be merged during switchover" [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [12:17:29] !log jebe@deploy1002 Started deploy [analytics/refinery@91bb4a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@91bb4a0] [12:18:07] (03PS2) 10Urbanecm: Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) [12:18:43] !log ms-be2048 swift package updates T346730 [12:18:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:47] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [12:18:51] (03CR) 10Urbanecm: [C: 03+2] Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [12:19:32] !log jebe@deploy1002 Finished deploy [analytics/refinery@91bb4a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@91bb4a0] (duration: 02m 03s) [12:19:33] (03Merged) 10jenkins-bot: Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm) [12:20:22] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [12:20:33] (03PS1) 10Muehlenhoff: idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921 [12:20:57] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff) [12:21:51] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila) [12:22:20] !log ms-be20[49-59] swift package updates T346730 [12:22:23] RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:22:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:28] (03PS1) 10BBlack: varnish mem: unify global default [puppet] - 10https://gerrit.wikimedia.org/r/958922 [12:23:36] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [12:25:34] (03PS1) 10BBlack: varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) [12:26:18] (03CR) 10Klausman: [C: 03+1] Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [12:27:17] (03CR) 10CI reject: [V: 04-1] varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack) [12:27:39] (03PS2) 10Muehlenhoff: idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921 [12:28:58] (03PS2) 10BBlack: varnish mem: unify global default [puppet] - 10https://gerrit.wikimedia.org/r/958922 [12:29:00] (03PS2) 10BBlack: varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) [12:29:02] (03CR) 10David Caro: [V: 03+1 C: 03+2] m:openstack::clientpackages*: add patch to the list of packages [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro) [12:30:31] (03PS3) 10BBlack: varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) [12:31:21] (03CR) 10CI reject: [V: 04-1] varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack) [12:32:58] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff) [12:33:54] (03CR) 10Elukey: Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [12:35:47] (03CR) 10Ilias Sarantopoulos: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [12:37:07] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:37:22] (03PS3) 10Muehlenhoff: idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921 [12:38:29] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:38:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff) [12:40:53] (03CR) 10Elukey: [C: 03+2] alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos) [12:44:18] !log ms-be20[60-73] swift package updates T346730 [12:44:19] (03CR) 10BBlack: "PCC confirms nop: https://puppet-compiler.wmflabs.org/output/958922/43377/" [puppet] - 10https://gerrit.wikimedia.org/r/958922 (owner: 10BBlack) [12:44:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:21] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [12:44:33] (03CR) 10BBlack: "PCC confirms nop here too: https://puppet-compiler.wmflabs.org/output/958923/43379/" [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack) [12:45:15] (03CR) 10BBlack: [C: 03+2] varnish mem: unify global default [puppet] - 10https://gerrit.wikimedia.org/r/958922 (owner: 10BBlack) [12:45:21] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [12:45:23] (03CR) 10BBlack: [C: 03+2] varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack) [12:53:57] (03CR) 10Klausman: [C: 03+1] Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [12:55:47] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:57:11] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:57:27] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [12:58:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1300). [13:00:04] Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:16] * Aca waves [13:00:42] (03CR) 10Slyngshede: [C: 03+2] IDM Switchover [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [13:01:05] (03PS1) 10Filippo Giunchedi: o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 [13:01:15] (03CR) 10Clément Goubert: [C: 03+1] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [13:01:43] * TheresNoTime can deploy [13:01:49] noicee [13:02:03] (03PS3) 10Samtar: Add namespace aliases to shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958421 (https://phabricator.wikimedia.org/T346588) (owner: 10Acamicamacaraca) [13:02:29] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:02:38] thanks TheresNoTime [13:02:47] ideally we'd get a +1 on that first, if you have a sec urbanecm? [13:03:20] Aca discussed the idea with me first, and i don't see a reason why not (otoh, i don't see a reason why yes either :D). if you want me to review the specific patch, can do. [13:03:55] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:04:05] * TheresNoTime will deploy [13:04:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958421 (https://phabricator.wikimedia.org/T346588) (owner: 10Acamicamacaraca) [13:04:20] sounds good :) [13:05:01] (03Merged) 10jenkins-bot: Add namespace aliases to shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958421 (https://phabricator.wikimedia.org/T346588) (owner: 10Acamicamacaraca) [13:05:02] yeah, appropriate documentation will be created on-wiki in the project namespace to enhance the usage of the shortcuts/aliases [13:05:09] (03PS5) 10Slyngshede: P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [13:05:19] !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons. [13:05:34] !log samtar@deploy1002 Started scap: Backport for [[gerrit:958421|Add namespace aliases to shwiki (T346588)]] [13:05:37] T346588: Add namespace aliases to shwiki - https://phabricator.wikimedia.org/T346588 [13:06:38] (03PS6) 10Slyngshede: P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [13:07:03] (03CR) 10Alexandros Kosiaris: [C: 03+1] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [13:07:34] (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [13:07:38] (03PS10) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) [13:07:40] (03PS10) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [13:07:42] (03PS10) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [13:07:44] (03PS10) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [13:07:46] (03PS10) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [13:08:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [13:08:33] !log jebe@deploy1002 Started deploy [analytics/refinery@2d9d6d0]: Regular analytics weekly train [analytics/refinery@2d9d6d0] [13:09:39] Aca: (almost ready for testing), but remind me, this requires a run of `namespaceDupes.php` after, correct? [13:09:41] (03PS7) 10Slyngshede: P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) [13:10:10] that is correct for the namespaceDupes q. [13:10:14] (ta) [13:10:15] I think yes. [13:11:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [13:12:20] (03PS10) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [13:12:44] (`K8s images build/push` step is taking longer than normal, fwiw) [13:13:11] Assuming it's running, nothing to worry about. [13:14:25] !log jebe@deploy1002 Finished deploy [analytics/refinery@2d9d6d0]: Regular analytics weekly train [analytics/refinery@2d9d6d0] (duration: 05m 52s) [13:14:32] (03CR) 10Slyngshede: [C: 03+2] P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede) [13:14:41] !log jebe@deploy1002 Started deploy [analytics/refinery@2d9d6d0] (thin): Regular analytics weekly train THIN [analytics/refinery@2d9d6d0] [13:14:45] !log jebe@deploy1002 Finished deploy [analytics/refinery@2d9d6d0] (thin): Regular analytics weekly train THIN [analytics/refinery@2d9d6d0] (duration: 00m 04s) [13:15:01] !log jebe@deploy1002 Started deploy [analytics/refinery@2d9d6d0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2d9d6d0] [13:15:20] !log ms-be10[44-60] swift package updates T346730 [13:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:26] T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 [13:16:22] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [13:17:08] !log jebe@deploy1002 Finished deploy [analytics/refinery@2d9d6d0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2d9d6d0] (duration: 02m 06s) [13:17:22] urbanecm: looking at the `scap-image-build-and-push-log`, it's the push that's taking a long time.. though iirc that's happened before, so not too concerned [13:17:37] (03CR) 10Elukey: Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [13:17:49] (as I type that it finishes, of course) [13:18:05] (03CR) 10Muehlenhoff: [C: 03+2] Switch netboxdb to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/958894 (owner: 10Muehlenhoff) [13:19:47] PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:54] (03CR) 10Jelto: [C: 03+2] miscweb/microsites: remove static-codereview resources [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto) [13:21:04] (03PS2) 10Jelto: miscweb/microsites: remove static-codereview resources [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) [13:21:10] ACKNOWLEDGEMENT - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service Slyngshede Switch-over, waiting for reimage. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:22:28] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff) [13:24:22] (03PS1) 10Btullis: Block any open angle brackets in Archiva mirrored URLs [puppet] - 10https://gerrit.wikimedia.org/r/958930 (https://phabricator.wikimedia.org/T318962) [13:24:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10Gehel) Approved for all the elastic and wdqs / wcqs access [13:24:52] (03CR) 10Muehlenhoff: [C: 03+2] idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff) [13:24:57] (03PS1) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 [13:25:12] (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:25:56] (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:26:27] (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:26:52] (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:27:12] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43382/console" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro) [13:27:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet [13:28:10] !log samtar@deploy1002 samtar and aleksandar: Backport for [[gerrit:958421|Add namespace aliases to shwiki (T346588)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [13:28:14] T346588: Add namespace aliases to shwiki - https://phabricator.wikimedia.org/T346588 [13:28:18] Aca: can you test? ^ [13:28:25] checking it now [13:29:04] (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:29:16] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 26): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43381/console" [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [13:29:29] (03CR) 10Btullis: Configure kafka-jumbo1015.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:29:33] 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Opened {T346759} for followups, this is done [13:30:05] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43384/console" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro) [13:32:32] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43385/console" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [13:33:08] (03CR) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:33:19] (03CR) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:33:27] (03PS2) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 [13:33:40] TheresNoTime: Looks good to me. Both Cyrillic and Latin aliases seems to work in the search box. [13:33:47] ack [13:33:50] !log samtar@deploy1002 samtar and aleksandar: Continuing with sync [13:34:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet [13:34:25] <_joe_> jouncebot: now [13:34:25] For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1300) [13:35:16] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey) [13:37:01] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) @ayounsi I should be able to update that on the server. but I will need to reboot it to apply changes. [13:37:22] (03Abandoned) 10Ssingh: Remove most knams references/comments [dns] - 10https://gerrit.wikimedia.org/r/953681 (owner: 10BCornwall) [13:37:28] (03PS3) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 [13:39:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:40:10] (03PS11) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [13:40:25] (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:41:23] (03PS11) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) [13:41:29] (03CR) 10Brouberol: [V: 03+2] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:42:17] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43386/console" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [13:42:28] (03PS11) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) [13:42:30] This window may overrun, deployment is being very slow at the moment [13:42:52] (03PS1) 10Filippo Giunchedi: alertmanager: fix email_confgs for ml team [puppet] - 10https://gerrit.wikimedia.org/r/958934 [13:43:21] (03PS2) 10Filippo Giunchedi: alertmanager: fix email_confgs for ml team [puppet] - 10https://gerrit.wikimedia.org/r/958934 [13:44:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:44:10] (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:44:48] * TheresNoTime has failures [13:44:58] (03PS11) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) [13:45:23] (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:45:25] PROBLEM - Check systemd state on ms-be1057 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:45:38] (03CR) 10Elukey: [C: 03+2] "Sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/958934 (owner: 10Filippo Giunchedi) [13:45:39] TheresNoTime: with which hosts? [13:45:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:45:50] taavi: https://phabricator.wikimedia.org/P52532 [13:46:25] !log jebe@deploy1002 Started deploy [airflow-dags/analytics@6b9855a]: (no justification provided) [13:46:37] !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-test-client1001.eqiad.wmnet [13:46:49] (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:47:05] (03PS4) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 [13:47:08] !log jebe@deploy1002 Finished deploy [airflow-dags/analytics@6b9855a]: (no justification provided) (duration: 00m 43s) [13:47:29] (03PS3) 10Giuseppe Lavagetto: kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463 [13:47:31] (03PS3) 10Giuseppe Lavagetto: wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709) [13:47:33] (03PS3) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709) [13:47:35] (03PS2) 10Giuseppe Lavagetto: kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) [13:47:37] (03PS2) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) [13:47:49] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:47:57] Aca: FYI, I believe the failure caused a rollback [13:47:59] claime: / serviceops: ^ TNT's error above seems to be with k8s capacity in codfw [13:48:07] or _joe_ maybe? [13:48:10] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:48:15] RECOVERY - Check systemd state on ms-be1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:48:31] taavi: checking [13:48:39] <_joe_> claime: are you looking, ack [13:49:04] hmm [13:49:10] Aca & taavi: I have a meeting on the hour, so will not be able to re-deploy if so. Are you (taavi) able to, if needed? [13:49:18] (03PS12) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) [13:49:21] this is what I see in kubectl describe pod: Warning FailedScheduling 60s default-scheduler 0/24 nodes are available: 18 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate. [13:49:32] mw-api-int in codfw [13:49:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:49:43] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43388/console" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [13:50:16] (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [13:50:17] <_joe_> taavi: ack, thanks [13:50:35] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating DNS record of kuberbetes2026 - jhancock@cumin2002" [13:50:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:50:55] taavi: thanks, it's the usual resource issue... [13:50:58] TheresNoTime Not a big deal. I can reschedule, if needed. [13:51:14] <_joe_> it will be resolved for good on thursday [13:51:49] !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox [13:51:52] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [13:52:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating DNS record of kuberbetes2026 - jhancock@cumin2002" [13:52:01] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [13:52:05] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:52:33] !log clean old puppet certs kafka_logging-eqiad_broker [13:52:34] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) This is just a serial, does this really need a reboot os the OS? (We can arrange for that, but the server would need to be drained first) [13:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:34] yeah, in the meantime I'll do the same as 0ffa20bb0b1f712388a7a2945f0b291c8e4b7449 [13:52:35] uff [13:52:40] * elukey amends sal [13:52:45] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:53:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] mcrouter: Specify missing CXXFLAGS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [13:53:29] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:53:30] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-test-client1001.eqiad.wmnet [13:53:31] stevemunene@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [13:53:37] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:54:00] (03CR) 10Stevemunene: [C: 03+2] Remove mention of an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/957862 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene) [13:54:37] (03PS1) 10Clément Goubert: Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 [13:54:45] (03CR) 10CI reject: [V: 04-1] Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert) [13:55:08] (03PS1) 10Slyngshede: C:idm::redis bind to both IPv4 and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958936 [13:55:15] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) Sure, but wmflib is a general purpose library and shouldn't make that assumption. So I'd rather do that via a parameter so that th... [13:55:21] Okay, scap is just doing the php-fpm restarts and then that deployment rollback is done [13:56:09] 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon) [13:56:19] (03CR) 10Btullis: [C: 03+2] Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal) [13:56:25] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43390/console" [puppet] - 10https://gerrit.wikimedia.org/r/958936 (owner: 10Slyngshede) [13:56:57] TheresNoTime: Is it rolling back the whole backport? We can schedule it again after the traffic/services switchover [13:57:10] PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:19] <_joe_> claime: AIUI it should not [13:57:25] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:958421|Add namespace aliases to shwiki (T346588)]] (duration: 51m 50s) [13:57:28] T346588: Add namespace aliases to shwiki - https://phabricator.wikimedia.org/T346588 [13:57:30] Yeah I thought it'd just rollback k8s [13:57:35] <_joe_> TheresNoTime: did it rollback everything? [13:57:36] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:57:40] ah, no, just k8s [13:57:47] Aca: should be live now then [13:58:09] <_joe_> claime: we can do a k8s only deployment manually I guess? [13:58:18] _joe_: yeah we can [13:58:28] (03PS1) 10Brouberol: Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041) [13:58:38] TheresNoTime Oh, okie. Thank you for handling all this! [13:58:53] !log `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki shwiki --fix` T346588 [13:58:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:25] * TheresNoTime away! [13:59:44] PROBLEM - Check systemd state on kafka-jumbo1015 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:59:46] <_joe_> claime: I can handle this if you need to follow the switchover [14:00:04] kamila_: Your horoscope predicts another unfortunate Datacenter switchover: Services + Traffic deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1400). [14:00:12] _joe_: appreciated, yeah, thanks [14:00:21] 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) yes, it's a bios setting. so it would require a reboot to apply. I should have caught that when I was fixing it the first time around so that's my bad. [14:00:25] jouncebot being very ominous [14:00:29] XD [14:00:42] <_joe_> TheresNoTime: the window is over, I'm gonna sync k8s by hand [14:00:52] _joe_: ack :) [14:00:55] !log kamila@deploy1002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 [14:00:55] ok so just a reminder, we're coordinating on -sre, I'll be monitoring -operations for issuess [14:01:03] T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 [14:01:12] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [14:01:17] TheresNoTime Do let me know if there are pages to fix [14:01:19] !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330 [14:01:31] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Service... [14:03:06] (03PS2) 10Brouberol: Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041) [14:03:39] (03PS2) 10Giuseppe Lavagetto: Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert) [14:05:06] PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [14:05:52] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:05:55] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346708 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [14:07:25] PROBLEM - Kafka Broker Server #page on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:07:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:08:08] kafka-jumbo1015 is apparently a new host, nothing to do with the switchover [14:08:15] <_joe_> yeah [14:08:43] I'll ack the alert [14:08:53] thanks [14:09:34] <_joe_> is jenkins broken? [14:10:13] (03PS11) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) [14:10:19] <_joe_> yeah this job is clearly broken https://integration.wikimedia.org/ci/job/helm-lint/12948/console [14:10:38] _joe_: I see multiple jobs proceeding, so it doesn't seem something like CI infra wide [14:11:06] <_joe_> I'm gonna be bold and add a V+2 to the change [14:11:08] it is too long linting things? :/ [14:11:20] PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:11:21] <_joe_> hashar: that job should usually finish in 1 minute [14:11:37] 7 already? interesting [14:11:42] jenkins+ 3515836 0.0 0.1 1495712 42068 ? Sl 14:03 0:00 | \_ docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/helm-lint:/workspace --security-opt seccomp=unconfined --init --rm --la [14:11:42] jenkins+ 3515838 0.0 0.0 0 0 ? Z 14:03 0:00 | \_ [bash] [14:11:56] from integration-agent-docker-1042.integration.eqiad1.wikimedia.cloud [14:12:01] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert) [14:12:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:12:48] RECOVERY - Check systemd state on kafka-jumbo1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:13:06] (03Merged) 10jenkins-bot: Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert) [14:13:15] RECOVERY - Kafka Broker Server #page on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration [14:13:20] <_joe_> and indeed, the job worked immediately [14:13:32] and the other got unlocked somehow [14:13:46] RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1015 is OK: SSL OK - Certificate kafka-jumbo1015.eqiad.wmnet valid until 2024-09-18 13:48:00 +0000 (expires in 364 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate [14:14:19] looks like it has spend 8 minutes trying to spin up the container [14:15:20] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/958936 (owner: 10Slyngshede) [14:16:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:16:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10VRiley-WMF) cloudtastic1007 A 2. U 26. port 17 CableID 5245 cloudtastic1008 B 2. U 25. port 36 CableID 5006 cloudtastic1009 C 2. U 27. por... [14:17:03] (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:17:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:35] (03PS1) 10Jclark-ctr: add pki1002 to T342892 [puppet] - 10https://gerrit.wikimedia.org/r/958943 (https://phabricator.wikimedia.org/T342892) [14:20:23] !log kamila@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 (duration: 19m 27s) [14:20:23] !log oblivian@deploy1002 Started scap: (no justification provided) [14:20:27] T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 [14:20:45] We're releasing the scap lock for an emergency mw-on-k8s fix, please DO NOT RUN SCAP RIGHT NOW [14:21:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:22:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:22:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:22:36] (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:22:36] ^ expected [14:22:42] (PHPFPMTooBusy) [14:23:17] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:25:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:25:35] !log oblivian@deploy1002 Finished scap: (no justification provided) (duration: 05m 44s) [14:26:26] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:26:32] (03CR) 10Fabfur: "Confirm that with the latest CR tests now are all fine (removed api.w.o from regex and tests as is managed by misc)" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [14:26:50] (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [14:27:03] (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:27:52] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:28:19] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thumbor [14:28:34] (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:47] !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover: Services - T346330 [14:28:51] T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 [14:28:57] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Service... [14:29:51] (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [14:30:32] !log kamila@deploy1002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 [14:31:05] (03PS1) 10Anzx: add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) [14:31:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:31:45] (03CR) 10CI reject: [V: 04-1] add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [14:32:03] !log Switch deployment server - T346330 [14:32:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:32:50] (03PS4) 10Kamila Součková: wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) [14:33:02] (03CR) 10Kamila Součková: [C: 03+2] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [14:33:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:33:11] (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [14:33:25] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift [14:33:26] (03PS3) 10Kamila Součková: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) [14:33:28] PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [14:33:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:33:54] !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro [14:34:45] (03CR) 10Kamila Součková: [C: 03+2] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [14:35:22] (03PS2) 10Anzx: add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) [14:36:01] (03CR) 10CI reject: [V: 04-1] add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx) [14:36:16] RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [14:36:21] !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-rw,name=eqiad [14:36:27] !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-rw,name=codfw [14:36:52] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:37:03] (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:37:18] (03PS3) 10Anzx: add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) [14:37:58] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:37:58] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:38:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:38:16] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:38:45] (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:39:15] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [14:39:24] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:39:24] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:39:42] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:40:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:40:50] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:41:18] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:41:33] (03CR) 10Vgutierrez: [C: 03+1] varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur) [14:42:02] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [14:42:30] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:42:46] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:43:36] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:08] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:44:22] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:45:14] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1026-56} - jclark@cumin1001" [14:46:16] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:46:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1026-56} - jclark@cumin1001" [14:46:17] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:46:44] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:46:55] (03CR) 10Andrea Denisse: [C: 03+1] "This looks pretty good, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [14:46:56] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:47:29] (03CR) 10Muehlenhoff: mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [14:47:46] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:10] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:10] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:10] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:22] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:28] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:32] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:48:32] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:48:33] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1007 [14:49:06] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:49:06] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:14] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:46] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:49:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1007 [14:49:49] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008 [14:50:00] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009 [14:50:07] !nowandnext [14:50:21] jouncebot: nowandnext [14:50:22] For the next 0 hour(s) and 9 minute(s): Datacenter switchover: Services + Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1400) [14:50:22] In 1 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1600) [14:50:30] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:50:40] !log installing python-werkzeug security updates [14:50:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:13] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008 [14:51:18] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:51:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009 [14:51:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) [14:51:36] thanks taavi [14:51:36] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010 [14:51:40] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:51:47] (I will rememer one day XD) [14:52:06] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:52:36] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:52:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010 [14:53:20] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:53:26] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:53:50] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:53:50] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:54:12] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:54:52] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:54:52] PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:55:02] (03PS1) 10Alexandros Kosiaris: Harmonize thumbor's eqiad/codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958953 [14:55:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [14:55:14] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:55:14] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:18] RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:31] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1008.mgmt.eqiad.wmnet with reboot policy FORCED [14:56:33] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1009.mgmt.eqiad.wmnet with reboot policy FORCED [14:56:34] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED [14:56:38] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [14:56:40] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:52] PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:56:52] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:56] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:56:56] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:56:57] (03CR) 10Volans: [C: 03+1] "LGTM (waiting for the switchover o finish)" [puppet] - 10https://gerrit.wikimedia.org/r/958943 (https://phabricator.wikimedia.org/T342892) (owner: 10Jclark-ctr) [14:57:34] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [14:57:34] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:04] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:16] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:20] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:26] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:58:45] (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [14:58:59] (03CR) 10Alexandros Kosiaris: [C: 03+2] Harmonize thumbor's eqiad/codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958953 (owner: 10Alexandros Kosiaris) [14:59:32] PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:44] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [14:59:48] (03Merged) 10jenkins-bot: Harmonize thumbor's eqiad/codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958953 (owner: 10Alexandros Kosiaris) [14:59:54] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:00:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:00:56] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:00:56] RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:02] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:20] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:02:23] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:02:27] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [15:02:47] !log increase thumbor's pods in codfw to 48 to harmonize with eqiad [15:02:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) So this system only supported UEFI mode, which we've not supported installing within WMF. If I can recall correctly, a few years ago we had a... [15:03:56] RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:03:56] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:04:06] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:04:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) IRC update: Chatted with Moritz in IRC and we're no where near supporting UEFI mode anytime in near to mid term. We should likely return these. [15:04:26] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:05:08] PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:05:18] !log kamila@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 (duration: 34m 46s) [15:05:23] T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 [15:05:32] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:00] !log cgoubert@deploy2002 Started scap: (no justification provided) [15:06:06] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:09] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED [15:06:36] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:48] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:48] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:06:54] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:07:28] 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [15:07:47] (03CR) 10Arturo Borrero Gonzalez: "LGTM. Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [15:08:00] RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:08:55] (03PS2) 10Kamila Součková: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) [15:09:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:09:50] <_joe_> uhhh now waht [15:10:04] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:46] <_joe_> claime: it looks like mw-web can't sustain all 5% of traffic in a single dc [15:10:55] Apparently yeah [15:11:00] RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:11:25] I'm sorry I'm balancing with scap rn [15:11:30] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [15:11:30] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:11:48] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:08] _joe_: It's only the canaries though [15:12:19] Are they getting a bigger portion of traffic than they should? [15:12:28] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [15:12:28] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:12:35] <_joe_> claime: we're at 250 rps [15:12:56] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:13:01] _joe_: canaries are getting 40rps [15:13:06] _joe_: it's 2 replicas, out of 14 [15:13:14] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:13:15] <_joe_> yeah that's a bit too much I'd say :) [15:13:54] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:14:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:14:18] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:14:20] (03CR) 10TK-999: mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999) [15:16:02] (03CR) 10AOkoth: ats: add ticket-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:16:30] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:17:56] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:18:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1009.mgmt.eqiad.wmnet with reboot policy FORCED [15:18:12] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1008.mgmt.eqiad.wmnet with reboot policy FORCED [15:19:26] (03PS2) 10Dr0ptp4kt: dr0ptp4kt WDQS, Search, Analytics access [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694) [15:20:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:21:48] (03PS1) 10Clément Goubert: mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) [15:22:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) (owner: 10Clément Goubert) [15:22:41] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) (owner: 10Clément Goubert) [15:24:47] (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) (owner: 10Clément Goubert) [15:25:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [15:25:19] eoghan, jnuche: a half-hour downtime for all the phabricator.wikimedia.org stuff should be sufficient [15:25:38] also, just to confirm, current deploy server is deploy2002.eqiad.wmnet, yeah? [15:25:38] Ok! I'll downtime that now if you're happy to go? [15:25:50] !log reduce mw-on-k8s traffic to 3% waiting on new nodes - T346330 [15:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:55] T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 [15:25:58] brennen: yes [15:26:22] brennen: deploy2002.codfw.wmnet, not eqiad [15:26:28] !log running puppet on 'A:cp-text and P{P:trafficserver::backend}' - T346330 [15:26:29] er, yeah [15:26:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:38] yeah sorry only looked at the number lol [15:26:44] claime: will i be stepping on your toes if we do a brief phab/phorge update now? [15:27:00] brennen: I'd rather you wait a tad [15:27:09] <_joe_> claime: lmk when scap finished [15:27:12] claime: ack. we can push this one. not an urgent update. [15:27:54] !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED [15:28:16] brennen did you run the submodule update on deploy2002? I'm still seeing the old commits [15:28:53] _joe_: It's deploying canaries, eqiad went fine but codfw is taking some time, afraid we're gonna run into the same capacity issue [15:29:05] <_joe_> sigh [15:29:59] <_joe_> claime: so a deployment won't work right now [15:30:03] <_joe_> we need to solve this [15:30:07] probably not [15:30:20] <_joe_> let's reduce the number of replicas? [15:30:26] <_joe_> would that help a bit? [15:30:45] (03PS5) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931 [15:30:46] Since we reduced traffic, we can reduce replicas, it would help [15:30:47] (03PS3) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042) [15:30:54] (03CR) 10David Caro: openstack::util::patch: add define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [15:31:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10MGerlach) [15:31:11] <_joe_> I'm not sure I understand fully what's the issue and why just on canaries [15:31:33] _joe_: because only 2 replicas and rolling restarts [15:31:39] <_joe_> jayme: can you please help with finding the solution to that problem? I have meetings [15:31:58] <_joe_> claime: so going to say 4 replicas there and reducing by 2 the main pool would help? [15:32:16] no [15:32:50] we could change the rollingupdate strategy for that deployment I guess [15:33:00] the issue is 25% of the new replicaset needs to be spun up and deemed healthy before 25% of the previous replicaset is killed [15:33:00] (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/canary on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:33:11] and from there doing it in 25% steps [15:33:21] so increasing the pods of canary would make it worse [15:33:35] if it is a no node found to satisfy the requests thing [15:33:48] if it is a quota thing... why do we have quotas for mediawiki ? [15:33:53] it's like THE ONE APP we got [15:34:19] It's no node found for scheduling [15:34:23] It's not quotas I don't think [15:34:36] ok, so yeah increasing pod # wouldn't help [15:34:39] FailedScheduling is available resources [15:34:41] ah, so it's overall capacity [15:34:58] the easy answer for this week is the thumbor trick [15:35:13] or how the cluster is currently packed (density) [15:35:45] but I don't see why this impacts just canaries [15:35:51] it should impact all mw deployments [15:36:00] and the ones with more pods should be impacted more [15:36:00] just by chance, no? [15:36:12] is it just that canaries are listed first as a release? [15:36:18] RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:36:21] or just chance as jayme says? [15:36:30] could we disable the canary release until thursday? [15:36:54] akosiaris: jayme: https://phabricator.wikimedia.org/P52534 [15:37:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:37:32] *growls* [15:37:52] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10MGerlach) Hi, @AKhatun_WMF is a returning staff member working with us (Research) as a contractor on a new [[ https://meta.wikimedia.org/wiki/Research:Improving_multili... [15:38:06] I can't do anything about it right now, scap is rolling back, and again encountering teh same deployment issue because there's just no resources [15:38:56] We need to scale back the main releases, deploy them *first*, then deploy the canaries [15:39:04] <_joe_> yes [15:39:06] ok, so we got a couple of options [15:39:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10thcipriani) >>! In T345877#9174389, @Vgutierrez wrote: > Thanks!, still blocked on @thcipriani for deployment group membershi... [15:39:34] <_joe_> btw we have all the time until the next backport window [15:39:49] * scale back main [15:39:49] * the trick the scheduler trick [15:39:49] * undeploy canaries [15:39:58] Undeploy is not an option [15:40:03] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:40:03] scap relies on canary releases working [15:40:10] ok, scratching it out. [15:40:21] <_joe_> I would suggest not to scale back mw-api-int though [15:40:24] (03PS1) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958962 [15:40:30] <_joe_> it's taking all the traffic in codfw right now [15:40:33] scale back main mw-api-ext and mw-web [15:40:39] <_joe_> ack [15:40:44] <_joe_> I vote go [15:40:46] (03CR) 10Vgutierrez: "access got approved :)" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez) [15:40:53] ok we've broken through helmfile failures in scap, it's doing bare metal now [15:40:59] Looks to be doing all right [15:41:37] can we scale down any kind of other workload ? [15:42:16] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:42:19] <_joe_> akosiaris: not sure, probably yes [15:42:21] we should honestly just de-deploy mw-jobrunner [15:42:31] It's just 1+1 replicas but they're functionally useless [15:42:34] * akosiaris looking [15:42:37] Not sure how to remove them from scap rn [15:42:45] <_joe_> claime: it's irrelevant and not sure how to do that right now though [15:43:00] (HelmReleaseBadStatus) resolved: (4) Helm release mw-api-ext/canary on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [15:43:02] (03PS2) 10Vgutierrez: admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) [15:43:05] _joe_: What is irrelevant? [15:43:16] why irrelevant? it frees space for 2 pods [15:43:17] <_joe_> mw-jobrunner [15:43:28] <_joe_> it's much more work than anything else to remove that [15:43:31] <_joe_> it has LVS [15:43:47] <_joe_> and we can reduce the replicas more [15:43:51] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 [15:43:58] _joe_: ^ [15:44:01] ah...lvs is a good point [15:44:20] <_joe_> we have elevated errors from mw-on-k8s [15:44:23] <_joe_> what's going on? [15:44:50] <_joe_> ah I think it's related to the scap failures [15:45:03] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:45:04] PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:22] RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:23] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [15:45:25] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [15:46:02] (03CR) 10Ssingh: [C: 03+1] admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez) [15:46:28] RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:46:36] another easy hack could be to (temporarily) remove the taint from the kask nodes [15:46:45] !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 40m 44s) [15:46:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff) [15:46:52] _joe_: scap done [15:46:54] <_joe_> those are ganeti nodes, we don't want mw there [15:47:14] <_joe_> seriously, let's reduce the external traffic to even 1% if needed [15:47:16] (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:47:18] PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:47:20] <_joe_> and let's reduce the number of replicas [15:47:25] I've reduced it to 3 [15:47:44] (03CR) 10JMeybohm: [C: 03+1] mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 (owner: 10Clément Goubert) [15:47:51] ack [15:48:09] And I'm waiting on alex to scrounge up some resources before scaling back replicas [15:48:24] This high error rate though, what [15:48:29] <_joe_> https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus%2Fops&orgId=1&viewPanel=18 doesn't look good [15:48:41] <_joe_> it happened with the deployment I think [15:48:46] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T346699 (10Gehel) [15:48:46] yeah [15:48:49] <_joe_> on api [15:48:55] <_joe_> let's look at logstash [15:48:57] I'm gonna do the scale back anyways, akosiaris [15:49:05] <_joe_> I think ti's a version mismatch of some kind [15:49:09] 'cause it could be the version mismatch [15:49:11] heh [15:49:22] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 (owner: 10Clément Goubert) [15:49:49] Give me a few minutes to square up the mw-on-k8s releases [15:50:04] "DBQueryError: Error 1146: Table 'wikidatawiki.revision_comment_temp' doesn't exist" [15:50:08] for various wikis [15:50:10] RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:50:11] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 (owner: 10Clément Goubert) [15:50:54] (03PS1) 10Alexandros Kosiaris: Scale down replicas of various services [deployment-charts] - 10https://gerrit.wikimedia.org/r/958965 [15:51:06] <_joe_> Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'wikidatawiki.revision_comment_temp' doesn't exist [15:51:15] that I said :) [15:51:29] since 15:33 [15:51:37] <_joe_> jynus: we're aware [15:51:41] <_joe_> it's related to the scap issues [15:51:45] scaling back main [15:51:49] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [15:51:50] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [15:51:51] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [15:51:54] https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958965 [15:52:00] is that something created by scap? [15:52:05] I 'll deploy the bigs ones immediately [15:52:05] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [15:52:06] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [15:52:07] <_joe_> claime: it's more important on mw-api-* [15:52:12] <_joe_> to redeploy [15:52:18] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:52:18] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [15:52:19] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [15:52:22] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [15:52:23] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [15:52:27] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [15:52:28] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [15:52:30] (03CR) 10Alexandros Kosiaris: [C: 03+2] Scale down replicas of various services [deployment-charts] - 10https://gerrit.wikimedia.org/r/958965 (owner: 10Alexandros Kosiaris) [15:52:40] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [15:52:41] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [15:52:53] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [15:52:54] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply [15:52:56] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply [15:52:57] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply [15:53:01] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply [15:53:08] ok, main scaled down [15:53:12] scapping a k8s redeploy [15:53:32] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:53:38] <_joe_> Amir1: around? [15:53:43] yes [15:53:46] <_joe_> any idea why that error would show up? [15:53:50] !log cgoubert@deploy2002 Started scap: (no justification provided) [15:53:54] <_joe_> the wikidata temp table [15:53:57] DBQueryError: Error 1146: Table 'wikidatawiki.revision_comment_temp' doesn't exist [15:54:01] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:54:04] it's for different wikis AIUI [15:54:11] (03PS1) 10AOkoth: ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/958987 (https://phabricator.wikimedia.org/T340027) [15:54:15] that is old code [15:54:15] not only wikidata [15:54:25] somehow old code is showing up? [15:54:25] !log scaling down mobileapps, wikifeeds, mathoid, similar-users [15:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:39] deployment host having old code? [15:54:45] <_joe_> Amir1: yeah I fear it's because of the deployment host [15:54:45] or old config [15:54:54] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:54:59] <_joe_> Amir1: can you check mediawiki-staging on deploy2002? [15:55:05] sure [15:55:05] It would have pushed that code to bare metal too wouldn't it? [15:55:14] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply [15:55:20] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [15:55:22] Do we have those errors from bare metal as well [15:55:24] ? [15:55:30] <_joe_> claime: I don't think so [15:55:39] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/similar-users: apply [15:55:49] Ok, scap k8s deployment going well [15:55:57] canaries are good [15:56:01] the config in mediawiki-staging is ok [15:56:06] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/similar-users: apply [15:56:06] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:56:09] <_joe_> jayme: are errors goiong down? [15:56:15] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/similar-users: apply [15:56:15] I don't see those errors for metal [15:56:22] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:56:47] _joe_: I'd say yes [15:56:57] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/similar-users: apply [15:57:02] claime: a few hosts below 90% now in codfw [15:57:03] !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 03m 12s) [15:57:04] ok so it was version mismatch between releases [15:57:05] <_joe_> ok so I fear I know what happened [15:57:15] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:57:19] <_joe_> the scap rollback tries to be very smart [15:57:21] <_joe_> and it should not [15:57:23] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:57:38] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [15:57:41] more context T215466 [15:57:42] T215466: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466 [15:57:46] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [15:57:56] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:57:57] (03CR) 10AOkoth: [C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/958987 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:58:03] (03PS2) 10Andrea Denisse: prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) [15:58:07] (03CR) 10AOkoth: [V: 03+2 C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/958987 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [15:58:10] errors gone [15:58:12] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [15:58:14] <_joe_> yep [15:58:17] <_joe_> errors gone [15:58:20] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [15:58:24] (03CR) 10CI reject: [V: 04-1] prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [15:58:37] <_joe_> ok, crisis averted [15:58:38] what was it? [15:58:38] ok we're fine [15:58:44] <_joe_> jynus: not now [15:58:46] ok [15:58:53] and T299954 [15:58:53] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [15:59:02] did scap roll back a helm rollback? [15:59:06] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [15:59:11] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [15:59:13] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt) [15:59:37] fwiw, the change happened months ago T299954#8930695 [15:59:47] how did we rollback to that version? [15:59:47] :-o [15:59:50] ok, multiple hosts now below 90% in wikikube@codfw [16:00:05] jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1600). Please do the needful. [16:00:06] Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [16:00:09] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt) @Gehel I added `airflow-search-admins` to the ticket description and amended the patch, after David said it might be something not need... [16:00:17] * urbanecm waves [16:00:18] (03CR) 10Kamila Součková: [C: 03+2] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková) [16:00:29] urbanecm: we have some fun issues right now [16:00:46] ack, can wait. [16:01:32] it might be some bug in the code that make it crop up again but we even removed that code a couple weeks ago https://gerrit.wikimedia.org/r/c/mediawiki/core/+/929717/ [16:01:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:01:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10Gehel) >>! In T346694#9179527, @dr0ptp4kt wrote: > @Gehel I added `airflow-search-admins` to the ticket description and amended the patch, after D... [16:02:14] (03CR) 10AOkoth: [C: 03+2] vrts: vrts1002 change global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/958565 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth) [16:02:16] (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:03:26] (03PS3) 10Andrea Denisse: prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) [16:03:30] We are doing some testing with flink-zk1001 in case any alerts arrive - you can ignore them. [16:04:16] PROBLEM - Zookeeper Server on flink-zk1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [16:04:24] ^expected [16:04:33] !log DC Switchover: traffic - T346330 [16:04:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:39] T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 [16:06:30] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:06:34] (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:08:34] (03PS2) 10TK-999: mcrouter: Specify missing CXXFLAGS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 [16:09:10] _joe_: let me know if I can help on anything. I can't find any trace of that code in deploy2002 [16:09:38] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:10:24] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:10:49] Amir1: he's in a meeting currently and I think we're currently safe again [16:11:28] jayme: awesome. I go to other stuff, ping me if needed [16:11:33] I'm not sure what exectly happened tbh but it seems j.oe has a theory :) [16:11:45] sure thing, thanks! [16:12:12] PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%): /tmp 0 MB (0% inode=97%): /var/tmp 0 MB (0% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [16:12:48] Looks like kerberos ate all of its kibble [16:13:38] btullis: ^ [16:14:30] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:14:44] (03PS1) 10Btullis: Add the analytics and search-pltform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) [16:15:00] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:21] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43397/console" [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis) [16:16:22] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:16:40] RECOVERY - Zookeeper Server on flink-zk1001 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper [16:20:19] (03CR) 10Ryan Kemper: [C: 03+1] Add the analytics and search-pltform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis) [16:20:23] claime: Thanks. Looking now. [16:20:27] (03PS2) 10Ryan Kemper: Add the analytics and search-platform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis) [16:22:17] (03CR) 10Majavah: [C: 04-1] "deployment process: https://phabricator.wikimedia.org/P49715" [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah) [16:23:18] 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) Disk usage hit 100% and I did this again: ` btullis@krb1001:~$ sudo truncate -s 10000 /var/log/kerberos/krb5kdc.log ` This was the size beforehand. ` btullis@... [16:23:22] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/958807/43396/" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse) [16:23:39] (03CR) 10Btullis: [C: 03+2] Add the analytics and search-platform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis) [16:23:58] PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-debian-version-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:24:44] urbanecm: I can deploy your puppet patch now if you want, but in exchange tell me if you know why startupregistrystats-testwiki could be failing since ~1200 UTC today :p [16:25:22] RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:25:40] claime: thanks, puppet patch deploy would be helpful. We deployed wmf.27 to testwiki this morning, so maybe that? [16:26:12] urbanecm: https://phabricator.wikimedia.org/P52535 [16:26:18] (03CR) 10Clément Goubert: [C: 03+2] growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm) [16:26:23] (wmf.27 deploy happened at ~4 UTC, so...maybe not) [16:27:02] okay, i blame .27 preliminarily. i can check later today. do we have a task? [16:27:30] not yet, just found out, it started alerting at 1614 [16:28:33] 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) [16:28:39] !log Deployed https://gerrit.wikimedia.org/r/953344 - T345204 [16:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:45] T345204: Alert the Growth team when number of available task recommendations drops significantly - https://phabricator.wikimedia.org/T345204 [16:28:52] Running puppet on mwmaint1002 and you'll be good [16:29:14] ty [16:29:29] done [16:31:33] 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) 05Open→03Resolved While there are some outstanding issues due to lack of capacity in codfw, overall we're done here :-) [16:31:38] 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila) [16:31:40] urbanecm: Actually it started failing even earlier [16:31:43] Sep 19 04:10:30 mwmaint1002 systemd[1]: mediawiki_job_startupregistrystats-testwiki.service: Main process exited, code=exited, status=1/FAILURE [16:31:47] I'm putting a task together [16:32:11] that's...even closer to the wmf.27 deploy [16:32:36] RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops [16:32:41] i'd mark that task as train blocker until it is investigated (by making it as a https://phabricator.wikimedia.org/T345888 subtask) [16:34:48] https://phabricator.wikimedia.org/T346800 [16:35:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10thcipriani) >>! In T342535#9171415, @RLazarus wrote: > @thcipriani Sorry for the back-and-forth, but just because it isn't 100% explicit from reading t... [16:39:54] PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:41:31] claime: is it intentional half of the paste disappeared? it looks like only left part of the stacktrace is there. [16:41:38] urbanecm: ugh [16:41:43] no not intentional [16:41:46] tmux shenanigans [16:42:23] urbanecm: should be good now [16:42:26] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:42:26] ty [16:42:42] RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:43:48] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:45:46] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w [16:45:46] wikimedia.org/wiki/Services/Monitoring/restbase [16:48:34] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:50:25] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10odimitrijevic) Approved [16:51:34] 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) 05Open→03Resolved a:03Eevans AFAIK, everything this issue aimed to solve has been (we are installing Cassandra on Bullseye). Closing. [16:52:28] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:53:11] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate sessionstore servers to Bullseye - https://phabricator.wikimedia.org/T331714 (10Eevans) 05Open→03Resolved a:03Eevans macro-deployed [16:55:20] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [16:55:49] claime: as for why it happens, see https://phabricator.wikimedia.org/T346800#9179846 :). [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1700) [17:03:21] (03CR) 10Herron: [C: 03+1] Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey) [17:03:47] (03CR) 10Herron: [C: 03+1] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [17:04:05] (03PS1) 10Jdlrobson: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) [17:09:56] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [17:13:58] PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:15:24] RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:42] 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Mabualruz) I am happy to attend another training session with access so I can try to gain some hands on experience. [17:20:21] (03CR) 10Ejegg: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg) [17:23:32] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10KFrancis) Hello all, I am confirming Aisha Khatun has a NDA on file. Thank you! [17:23:34] (03CR) 10CI reject: [V: 04-1] Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [17:39:56] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:42:46] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [17:51:21] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Aklapper) @Ahoelzl: I apologize, my previous comments were likely confusing. (You cannot reset a password on mediawiki.org as it is a global SUL account and thus resets wou... [17:51:35] (03CR) 10Andrew Bogott: [C: 03+1] "This sent me on a quest to figure out what the difference is between the named arg --input and the positional patch file but the docs abso" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro) [17:54:51] (03CR) 10Btullis: [C: 03+1] Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [18:00:05] brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1800). [18:00:14] o/ [18:00:18] train currently blocked. [18:06:04] (03PS1) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [18:06:32] (03CR) 10CI reject: [V: 04-1] vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth) [18:08:11] (03PS2) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [18:08:36] (03CR) 10CI reject: [V: 04-1] vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth) [18:09:33] (03PS3) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [18:10:39] (03CR) 10RobH: [C: 03+2] add pki1002 to T342892 [puppet] - 10https://gerrit.wikimedia.org/r/958943 (https://phabricator.wikimedia.org/T342892) (owner: 10Jclark-ctr) [18:23:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:23:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [18:26:40] PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:27:15] (03PS11) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 [18:27:17] (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson) [18:28:06] RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:29:34] PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:58] RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:33:16] PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:34:40] RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:36:51] (03PS1) 10Jforrester: Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) [18:44:19] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/958968 [18:45:51] (03CR) 10Bartosz Dziewoński: [C: 03+1] Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) (owner: 10Jforrester) [18:47:37] (03PS1) 10Andrew Bogott: dbproxy1018: depool clouddb1019 in favor of clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/959036 (https://phabricator.wikimedia.org/T346826) [18:49:31] James_F: thanks for revert, i'll deploy that one. [18:49:48] brennen: YW. [18:52:07] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) (owner: 10Jforrester) [18:52:13] (03CR) 10Andrew Bogott: [C: 03+2] dbproxy1018: depool clouddb1019 in favor of clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/959036 (https://phabricator.wikimedia.org/T346826) (owner: 10Andrew Bogott) [18:52:17] (03PS4) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [18:52:32] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:53:58] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [18:54:07] Jdlrobson: talk to me about T342277 - is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/959035/ just needed before train can roll? [18:54:08] T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277 [19:00:11] (03PS5) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [19:02:46] (03PS6) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [19:04:29] (03PS1) 10Jforrester: Wikifunctions: Update evaluator image to 2023-09-19-183305 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959037 [19:04:45] (03PS1) 10Eevans: Be explicit about the yaml loader class [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959038 [19:06:30] (03Merged) 10jenkins-bot: Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) (owner: 10Jforrester) [19:07:18] !log brennen@deploy2002 Started scap: Backport for [[gerrit:959009|Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" (T346800)]] [19:07:25] T346800: startupregistrystats-testwiki periodic job fails - https://phabricator.wikimedia.org/T346800 [19:08:18] (03PS1) 10Gmodena: data-engineering: eventgate: standardize alerts [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002) [19:14:14] (03PS1) 10Jdlrobson: Disable client preferences cog in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) [19:15:05] (03PS1) 10Jdlrobson: Fixes cannot read properties of undefined [extensions/MobileFrontend] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959010 (https://phabricator.wikimedia.org/T342277) [19:15:07] (03CR) 10CI reject: [V: 04-1] Disable client preferences cog in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson) [19:16:19] (03CR) 10Jforrester: [C: 03+1] "To make clear here as well as on Phabricator, there are no objections from the development team. Would you like me to deploy this, or shou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757) [19:16:48] (03CR) 10Jforrester: [C: 03+1] "As with I3d1115e97, there are no objections from the development team. Would you like me to deploy this, or should I leave it to you to sc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757) [19:20:09] !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1015 [19:21:34] !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1015 [19:23:00] PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:24:24] RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:24:26] !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED [19:29:39] !log brennen@deploy2002 jforrester and brennen: Backport for [[gerrit:959009|Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" (T346800)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [19:29:45] T346800: startupregistrystats-testwiki periodic job fails - https://phabricator.wikimedia.org/T346800 [19:31:50] !log brennen@deploy2002 jforrester and brennen: Continuing with sync [19:33:17] (03CR) 10Kimberly Sarabia: Disable client preferences cog in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson) [19:33:43] (03CR) 10Brouberol: [C: 03+2] Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol) [19:37:24] (03PS1) 10Majavah: Set READ_NEW for Wikitech on OATHAuth multiple devices migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959042 (https://phabricator.wikimedia.org/T242031) [19:37:26] (03PS1) 10Majavah: Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959043 (https://phabricator.wikimedia.org/T242031) [19:37:27] jouncebot: nowandnext [19:37:27] For the next 0 hour(s) and 22 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1800) [19:37:27] In 0 hour(s) and 22 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T2000) [19:38:34] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:39:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:41:23] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) pc1016 - C 6. U 31. port 30 CableID 3252 is having issues, will recheck cabling [19:41:48] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:43:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:44:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:46:22] 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans) [19:47:54] (03PS1) 10Jdlrobson: Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345664) [19:48:01] (03Abandoned) 10Jdlrobson: Disable client preferences cog in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson) [19:48:05] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:959009|Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" (T346800)]] (duration: 40m 46s) [19:48:10] 10SRE, 10Cassandra, 10Data-Persistence: Migrate cassandra-dev to Bullseye - https://phabricator.wikimedia.org/T331711 (10Eevans) 05Open→03Resolved a:03Eevans macro-deployed [19:48:15] T346800: startupregistrystats-testwiki periodic job fails - https://phabricator.wikimedia.org/T346800 [19:48:38] PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:50:02] RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [19:51:09] (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [19:52:35] (03PS2) 10Jdlrobson: Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345363) [19:56:00] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [19:56:09] (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T2000). [20:00:05] Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:23] (unavailable this evening, sorry) [20:00:53] @brennen around per your phab comment? 2 of these 3 are train blockers [20:01:45] Jdlrobson: yep [20:01:52] shall we just go in order? [20:02:04] i can deploy too [20:02:10] brennen: just waiting on CI and a review from a team mate on 959013 so that should be later in the deploy window [20:02:13] or maybe brennen's on it? [20:02:16] (i guess also: can any of these go out together) [20:02:24] urbanecm: i can handle this one, doing the train today anyhow [20:02:32] ack, ty. [20:02:58] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/959010/ can o first [20:03:22] (03CR) 10Jdlrobson: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [20:03:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959010 (https://phabricator.wikimedia.org/T342277) (owner: 10Jdlrobson) [20:08:15] https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/959013 should be ready [20:10:24] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:10:27] ^ brennen [20:10:38] ack, waiting on previous patch. [20:11:03] 👍 [20:17:28] (03Merged) 10jenkins-bot: Fixes cannot read properties of undefined [extensions/MobileFrontend] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959010 (https://phabricator.wikimedia.org/T342277) (owner: 10Jdlrobson) [20:18:02] !log brennen@deploy2002 Started scap: Backport for [[gerrit:959010|Fixes cannot read properties of undefined (T342277)]] [20:18:08] T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277 [20:23:44] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 81, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:24:16] (03CR) 10Brennen Bearnes: [C: 03+2] Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson) [20:24:36] (just realized i could +2 that one to get tests moving.) [20:26:05] (03PS1) 10Fabfur: WIP: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049 [20:26:23] (03PS1) 10Fabfur: allow to specify buffer size for backend, frontend or both [software/purged] - 10https://gerrit.wikimedia.org/r/959050 [20:32:20] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [20:34:35] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudelastic1007-10 - jclark@cumin1001" [20:35:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt cloudelastic1007-10 - jclark@cumin1001" [20:35:22] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:36:01] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010 [20:36:04] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010 [20:36:38] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:37:02] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED [20:38:02] (03Merged) 10jenkins-bot: Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson) [20:38:34] !log brennen@deploy2002 jdlrobson and brennen: Backport for [[gerrit:959010|Fixes cannot read properties of undefined (T342277)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [20:38:40] T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277 [20:39:01] Jdlrobson: anything to check on this one? [20:39:47] yeh i can verify on test wiki [20:40:40] brennen: ^ [20:41:31] Jdlrobson: ack, i await your signal. [20:41:43] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) [20:41:54] brennen: is it on debug servers? I'm not seeing the changes [20:42:28] brennen: ah now i am :) [20:42:37] the MobileFrontend is good to go [20:42:39] !log brennen@deploy2002 jdlrobson and brennen: Continuing with sync [20:42:42] cool, goin' [20:42:47] i'm not seeing the Vector one yet [20:42:50] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:43:02] presumably that's not synced yet? [20:43:15] yeah, not yet. [20:44:16] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:44:36] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [20:50:17] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) [20:50:27] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) (duration: 00m 09s) [20:50:42] jouncebot nowandnext [20:50:42] For the next 0 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T2000) [20:50:42] In 9 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0600) [20:51:40] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) [20:51:46] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) (duration: 00m 05s) [20:54:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) [20:55:41] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:959010|Fixes cannot read properties of undefined (T342277)]] (duration: 37m 39s) [20:55:48] T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277 [20:55:52] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki1002'] [20:57:03] !log brennen@deploy2002 Started scap: Backport for [[gerrit:959013|Disable client preferences by default (T345363)]] [20:57:07] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pki1002'] [20:57:09] T345363: Create font size settings interface functionality for vector - https://phabricator.wikimedia.org/T345363 [20:57:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) a:03Jclark-ctr [20:58:24] PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [20:58:55] this one should be a bit faster since it's already merged. [20:59:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007'] [20:59:50] RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:01:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED [21:01:46] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009'] [21:03:00] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) Thanks. With help of tech support I claimed my mediawiki.org AHoelzl-WMF account. It wasn't straightforward though ... I was able to link it to Phabricator. Are yo... [21:05:10] Jdlrobson: correct in thinking https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/959007/ shouldn't block? [21:05:30] brennen: yeh i think urbanecm said he could do this tomorrow [21:05:35] there's a CI issue on it [21:05:39] kk [21:05:51] ah urbanecm just got back to me about the CI issue [21:05:55] once the vector one finishes, i'll roll train forward. [21:06:03] but yeh I think this will have to wait until tomorrow [21:06:08] sorry urbanecm [21:07:15] No worries. Can do it in the morning. [21:07:35] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1007'] [21:07:42] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008'] [21:07:44] But this error can be bypassed tbh. [21:11:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1009'] [21:11:31] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010'] [21:14:05] (03PS1) 10Urbanecm: build: Update eslint-config-wikimedia to 0.25.1 [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959014 (https://phabricator.wikimedia.org/T346629) [21:14:21] (03PS2) 10Urbanecm: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [21:14:28] (03PS3) 10Urbanecm: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [21:14:34] (03PS4) 10Urbanecm: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [21:16:00] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008'] [21:16:09] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) [21:17:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @bking @RKemper Please update when partman recipe in puppet repo is finished [21:17:26] !log brennen@deploy2002 jdlrobson and brennen: Backport for [[gerrit:959013|Disable client preferences by default (T345363)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option) [21:17:32] T345363: Create font size settings interface functionality for vector - https://phabricator.wikimedia.org/T345363 [21:17:54] Jdlrobson: ^ vector patch checkable? [21:20:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1010'] [21:21:48] brennen: yep checking now [21:22:22] brennen: yep all good [21:22:30] please sync and roll forward the train! [21:24:58] cool, ty [21:25:01] !log brennen@deploy2002 jdlrobson and brennen: Continuing with sync [21:26:04] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:26:53] (03CR) 10Jdlrobson: [C: 03+1] Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson) [21:28:58] (03CR) 10Herron: [V: 03+1 C: 03+2] titan: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron) [21:29:03] !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db12[26-33] - jclark@cumin1001" [21:29:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [21:29:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db12[26-33] - jclark@cumin1001" [21:29:51] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:30:19] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1226 [21:30:22] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1227 [21:31:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1227 [21:31:26] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1229 [21:31:41] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1226 [21:31:46] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1230 [21:32:20] !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [21:32:24] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [21:32:47] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1229 [21:32:57] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1230 [21:33:44] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:34:52] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1231 [21:34:54] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1232 [21:34:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1232 [21:35:09] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1233 [21:36:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1231 [21:36:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1233 [21:36:44] PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI [21:36:44] est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:37:04] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1232 [21:37:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1232 [21:37:49] !log brennen@deploy2002 Finished scap: Backport for [[gerrit:959013|Disable client preferences by default (T345363)]] (duration: 40m 45s) [21:37:53] thanks brennen. good luck with the train! [21:37:54] T345363: Create font size settings interface functionality for vector - https://phabricator.wikimedia.org/T345363 [21:38:06] RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [21:39:17] (03PS7) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [21:39:44] (03PS8) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [21:41:13] !log train 1.41.0-wmf.27 (T345888): blockers resolved; rolling to group0 [21:41:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:25] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [21:41:28] (03CR) 10Cwhite: [C: 03+1] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi) [21:41:42] (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959058 (https://phabricator.wikimedia.org/T345888) [21:41:44] (03CR) 10Cwhite: [C: 03+1] remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [21:41:46] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959058 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [21:42:33] (03CR) 10Cwhite: [C: 03+1] dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [21:43:06] (03CR) 10Cwhite: [C: 03+1] dispatch::web: add ensure param and ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron) [21:43:09] (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959058 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot) [21:43:44] (03PS9) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [21:43:56] (03PS10) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 [21:45:05] (03CR) 10Cwhite: [C: 03+2] rsyslog: ingest 'excimer' logs from webperf to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle) [21:45:39] (03PS1) 10Ebernhardson: Draft: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) [21:45:40] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [21:46:58] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:47:14] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1232 [21:48:21] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1232 [21:49:45] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1226.mgmt.eqiad.wmnet with reboot policy FORCED [21:49:46] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [21:49:48] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1230.mgmt.eqiad.wmnet with reboot policy FORCED [21:49:50] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1228.mgmt.eqiad.wmnet with reboot policy FORCED [21:49:51] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1227.mgmt.eqiad.wmnet with reboot policy FORCED [21:50:35] !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.27 refs T345888 [21:50:40] T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888 [21:51:34] (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:51:51] (03CR) 10Ebernhardson: "This will also need a puppet patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson) [21:56:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [21:57:19] (03PS2) 10Bking: rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) [21:58:51] (03CR) 10Bking: rdf-streaming-updater: start adding per-env ZK path root (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking) [22:23:09] (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:23:32] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [22:37:04] PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:38:28] RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:40:12] RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:44:30] PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:45:26] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [22:46:12] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:46:18] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [22:47:36] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [22:51:24] !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) [22:51:29] !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) (duration: 00m 05s) [22:54:05] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [22:56:03] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1226.mgmt.eqiad.wmnet with reboot policy FORCED [22:56:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1227.mgmt.eqiad.wmnet with reboot policy FORCED [22:57:18] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1230.mgmt.eqiad.wmnet with reboot policy FORCED [22:57:23] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1228.mgmt.eqiad.wmnet with reboot policy FORCED [22:58:22] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1231.mgmt.eqiad.wmnet with reboot policy FORCED [22:58:23] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1232.mgmt.eqiad.wmnet with reboot policy FORCED [22:58:25] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1233.mgmt.eqiad.wmnet with reboot policy FORCED [23:07:40] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:09:06] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:12:49] (03PS1) 10Cwhite: Revert "Add the analytics and search-platform teams to flink zk contacts" [puppet] - 10https://gerrit.wikimedia.org/r/959015 [23:13:36] (03CR) 10Cwhite: [C: 03+2] Revert "Add the analytics and search-platform teams to flink zk contacts" [puppet] - 10https://gerrit.wikimedia.org/r/959015 (owner: 10Cwhite) [23:16:16] PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:18:39] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1231.mgmt.eqiad.wmnet with reboot policy FORCED [23:18:43] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1233.mgmt.eqiad.wmnet with reboot policy FORCED [23:19:06] RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:23:16] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [23:24:10] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [23:26:20] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1232.mgmt.eqiad.wmnet with reboot policy FORCED [23:26:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [23:27:22] 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) Apologies for the amount of time that has passed, I only just noticed this ticket. I took a quick scan of the repo and hav... [23:29:28] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [23:30:49] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [23:31:15] !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [23:35:31] PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:36:39] RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:40:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr) [23:51:02] !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED [23:51:43] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1226'] [23:51:47] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1227'] [23:52:15] PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [23:52:49] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1228'] [23:52:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1229'] [23:52:57] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1230'] [23:55:03] RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase