[00:01:05] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:10:31] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) firing: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:24:25] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:24:57] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:25:19] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:30:31] <jinxer-wm>	 (Wikidata Reliability Metrics - Median loading time alert) resolved: <no value>   - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+Median+loading+time+alert
[00:32:17] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:35:59] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.288 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:36:51] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50567 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[00:38:14] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957818
[00:38:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957818 (owner: 10TrainBranchBot)
[00:51:49] <icinga-wm>	 RECOVERY - dump of s5 in codfw on backupmon1001 is OK: Last dump for s5 at codfw (db2101) taken on 2023-09-19 00:00:15 (61 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[00:52:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/957818 (owner: 10TrainBranchBot)
[00:56:21] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[00:57:45] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:08:49] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346708 (10phaultfinder)
[01:09:47] <icinga-wm>	 RECOVERY - dump of s5 in eqiad on backupmon1001 is OK: Last dump for s5 at eqiad (db1216) taken on 2023-09-19 00:00:03 (61 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[01:15:19] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:40:03] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:46:31] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:49:59] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[01:51:21] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:00:04] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0200)
[02:04:37] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:32] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:06:52] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.27 [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/957819 (https://phabricator.wikimedia.org/T345888)
[02:06:58] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.27 [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/957819 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[02:09:37] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:14:37] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:20:52] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.27 [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/957819 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[02:21:32] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:36:31] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:37:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:41:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:42:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:50:15] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:52:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2013 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:54:09] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2013 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[02:56:31] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0300)
[03:01:25] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958595 (https://phabricator.wikimedia.org/T345888)
[03:01:27] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958595 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[03:02:09] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958595 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[03:02:43] <logmsgbot>	 !log mwpresync@deploy1002 Started scap: testwikis wikis to 1.41.0-wmf.27  refs T345888
[03:02:47] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[03:04:37] <jinxer-wm>	 (JobUnavailable) firing: (7) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:11:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:14:59] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:20:23] <icinga-wm>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL: Improperly owned (0:0) files in /srv/mediawiki-staging https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[03:25:44] <hashar>	 PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is CRITICAL
[03:27:12] <hashar>	 that broke a mediawiki train
[03:27:25] <hashar>	 see eg https://gerrit.wikimedia.org/r/c/operations/puppet/+/927674
[03:30:45] <icinga-wm>	 RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy2002 is OK: Files ownership is ok. https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner
[03:37:40] <wikibugs>	 (03Abandoned) 10Anzx: add extranamespacenames for kannada-kn language wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958050 (https://phabricator.wikimedia.org/T346583) (owner: 10Anzx)
[04:01:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:03:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:03:49] <logmsgbot>	 !log mwpresync@deploy1002 Finished scap: testwikis wikis to 1.41.0-wmf.27  refs T345888 (duration: 61m 05s)
[04:03:59] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[04:06:01] <logmsgbot>	 !log mwpresync@deploy1002 Pruned MediaWiki: 1.41.0-wmf.25 (duration: 02m 10s)
[04:08:43] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:10:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:14:29] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:28:31] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: deploy eswiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958598 (https://phabricator.wikimedia.org/T346445)
[04:31:18] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: deploy eswiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958598 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[04:32:30] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: deploy eswiki in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958598 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[04:35:29] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[04:39:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[04:40:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:15:41] <Bsadowski1>	 "2023-09-19 05:07:47: Fatal exception of type "Wikimedia\Rdbms\DBQueryTimeoutError""
[05:15:42] <Bsadowski1>	 hmm
[05:24:20] <wikibugs>	 (03Restored) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582) (owner: 10Anzx)
[05:32:59] <wikibugs>	 (03PS4) 10Anzx: Enable wgMinervaEnableSiteNotice for knwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958049 (https://phabricator.wikimedia.org/T346582)
[05:43:07] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:43:15] <icinga-wm>	 PROBLEM - thanos.wikimedia.org requires authentication on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[05:43:25] <icinga-wm>	 PROBLEM - thanos.wikimedia.org tls expiry on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[05:43:31] <icinga-wm>	 PROBLEM - SSH on titan1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:43:49] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1019 is CRITICAL: PYBAL CRITICAL - CRITICAL - thanos-web_443: Servers titan1001.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[05:45:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1134', diff saved to https://phabricator.wikimedia.org/P52522 and previous config saved to /var/cache/conftool/dbconfig/20230919-054539-root.json
[05:46:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1134.eqiad.wmnet onto db1128.eqiad.wmnet
[05:46:31] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[05:48:04] <wikibugs>	 (03PS1) 10Marostegui: db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/958601
[05:48:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1134: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/958601 (owner: 10Marostegui)
[05:48:44] <logmsgbot>	 !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudservices2004-dev.codfw.wmnet with OS bookworm
[05:54:27] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:54:55] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:55:35] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[05:55:37] <icinga-wm>	 RECOVERY - thanos.wikimedia.org requires authentication on titan1001 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 544 bytes in 0.047 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[05:55:47] <icinga-wm>	 RECOVERY - thanos.wikimedia.org tls expiry on titan1001 is OK: OK - Certificate thanos-query.discovery.wmnet will expire on Mon 21 Jul 2025 03:04:56 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[05:55:51] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[05:55:53] <icinga-wm>	 RECOVERY - SSH on titan1001 is OK: SSH OK - OpenSSH_9.2p1 Debian-2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[05:56:31] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0600).
[06:09:30] <XioNoX>	 !log push new pfw policy - T346705
[06:09:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:12:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto)
[06:16:39] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:18:05] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:27:41] <wikibugs>	 (03PS1) 10Andrea Denisse: prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656)
[06:29:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[06:32:19] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:33:19] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[06:35:16] <denisse>	 !log updating PCC facts
[06:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:37:17] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: update kserve 0.11 in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958808 (https://phabricator.wikimedia.org/T346445)
[06:39:44] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[06:44:44] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[06:51:54] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714)
[06:52:17] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463
[06:52:19] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709)
[06:52:21] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709)
[06:52:23] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714)
[06:52:25] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714)
[06:54:51] <wikibugs>	 (03PS2) 10KartikMistry: Disable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772)
[06:59:07] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: update kserve 0.11 in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958808 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[06:59:32] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10MoritzMuehlenhoff) Also needs approval by @Gehel being the approver for elasticsearch-roots et al.
[07:00:07] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (once approvals on the Phab tasks are in)" [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694) (owner: 10Dr0ptp4kt)
[07:00:08] <jouncebot>	 Amir1, Urbanecm, and taavi: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T0700)
[07:00:08] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:24] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update kserve 0.11 in ml staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/958808 (https://phabricator.wikimedia.org/T346445) (owner: 10Ilias Sarantopoulos)
[07:02:14] * kart_ is here and will self deploy..
[07:02:31] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) (owner: 10KartikMistry)
[07:03:16] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Special:Contribute on bnwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/956078 (https://phabricator.wikimedia.org/T345772) (owner: 10KartikMistry)
[07:04:10] <logmsgbot>	 !log kartik@deploy1002 Started scap: Backport for [[gerrit:956078|Disable Special:Contribute on bnwiki (T345772)]]
[07:04:14] <stashbot>	 T345772: Disable Special:Contribute on bnwiki - https://phabricator.wikimedia.org/T345772
[07:06:09] <wikibugs>	 (03CR) 10Arnaudb: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/957820 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[07:06:11] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[07:06:22] <wikibugs>	 (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[07:07:23] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "PCC https://puppet-compiler.wmflabs.org/output/927674/2301/" [puppet] - 10https://gerrit.wikimedia.org/r/927674 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[07:08:35] <wikibugs>	 (03CR) 10Jelto: "Adding Jaime because we remove the backup of static-codereview files. The files live in Git now: https://gitlab.wikimedia.org/repos/sre/mi" [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[07:09:29] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10MoritzMuehlenhoff) >>! In T335879#9173531, @Volans wrote: > This leave us just with two options: > * catch the exception in the cookbooks...
[07:11:58] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' .
[07:15:11] <kart_>	 scap seems stuck since last 5 minutes at: `07:06:02 K8s images build/push output redirected to /home/kartik/scap-image-build-and-push-log` or is it normal?
[07:16:10] <wikibugs>	 (03CR) 10Hashar: "https://puppet-compiler.wmflabs.org/output/927675/2317/" [puppet] - 10https://gerrit.wikimedia.org/r/927675 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[07:16:12] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS report: remove MODEL_EXCLUDES filter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813
[07:16:14] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS report: add equivalent model strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519)
[07:16:16] <wikibugs>	 (03PS1) 10Ayounsi: LibreNMS report: use black formating [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815
[07:16:38] <wikibugs>	 (03CR) 10Hashar: "Puppet compiler https://puppet-compiler.wmflabs.org/output/927676/2318/" [puppet] - 10https://gerrit.wikimedia.org/r/927676 (https://phabricator.wikimedia.org/T338205) (owner: 10Hashar)
[07:21:02] <kart_>	 In any case, scap seems super slow today?
[07:22:02] <hashar>	 always on tuesday
[07:22:12] <hashar>	 kart_: it is busy synchronizing the new wmf version that got cut over night
[07:22:23] <kart_>	 ah.
[07:22:47] <kart_>	 I'm deploying config change and it took 20 minutes and not yet reached to debug servers :/
[07:24:40] <kart_>	 MW deployment window seems late today? https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1800
[07:26:52] <logmsgbot>	 !log kartik@deploy1002 kartik: Backport for [[gerrit:956078|Disable Special:Contribute on bnwiki (T345772)]] synced to the testservers mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[07:26:58] <stashbot>	 T345772: Disable Special:Contribute on bnwiki - https://phabricator.wikimedia.org/T345772
[07:27:39] <logmsgbot>	 !log kartik@deploy1002 kartik: Continuing with sync
[07:32:11] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/958417
[07:33:13] <wikibugs>	 (03PS4) 10Slyngshede: WIP: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721)
[07:33:34] <jinxer-wm>	 (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:36:12] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43340/console" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[07:38:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (5) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[07:39:10] <wikibugs>	 (03PS5) 10Slyngshede: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721)
[07:39:18] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] alertmanager: create ml team alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[07:40:35] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43341/console" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[07:40:47] <wikibugs>	 (03CR) 10Slyngshede: P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[07:42:59] <logmsgbot>	 !log kartik@deploy1002 Finished scap: Backport for [[gerrit:956078|Disable Special:Contribute on bnwiki (T345772)]] (duration: 38m 49s)
[07:43:03] <stashbot>	 T345772: Disable Special:Contribute on bnwiki - https://phabricator.wikimedia.org/T345772
[07:43:23] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:44:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "If tested that Q() behaves as expected looks good to me." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 (owner: 10Ayounsi)
[07:44:47] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:51:07] <moritzm>	 !log installing libwep security updates on buster
[07:51:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:12] <moritzm>	 !log installing libwebp security updates on buster
[07:51:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:53:41] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:55:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[07:56:58] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1146.eqiad.wmnet with OS bullseye
[07:59:51] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[08:00:25] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[08:02:05] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply
[08:02:25] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:02:36] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply
[08:04:46] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1134.eqiad.wmnet onto db1128.eqiad.wmnet
[08:05:02] <moritzm>	 !log restarting FPM on mw canaries to pick up libwebp updates
[08:05:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:05:11] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply
[08:05:25] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.429 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[08:05:28] <logmsgbot>	 !log brouberol@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply
[08:10:13] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:idm switch idm2001 to Debian package [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[08:10:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/957669 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[08:10:36] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage
[08:11:41] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.reimage for host idm2001.wikimedia.org with OS bookworm
[08:11:51] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bookworm
[08:12:52] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826)
[08:13:04] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1146.eqiad.wmnet with reason: host reimage
[08:16:45] <wikibugs>	 10SRE: Icinga contact for dr0ptp4kt - https://phabricator.wikimedia.org/T346688 (10Peachey88)
[08:17:32] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[08:18:09] <hashar>	 kart_: is that still syncing?  I think scap/rsync/whatever has some issues indeed
[08:18:20] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[08:18:22] <hashar>	 at a quick glance it seems the rsync are taking way longer than usual, but I have to dig the logs
[08:18:42] <brouberol>	 !log redeploying eventstream-analytics in codfw T336041
[08:18:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:18:45] <stashbot>	 T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041
[08:18:53] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[08:19:44] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[08:20:34] <brouberol>	 !log redeploying eventstream-analytics-external in eqiad T336041
[08:20:36] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply
[08:20:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:16] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply
[08:21:41] <brouberol>	 !log redeploying eventstream-analytics-external in codfw T336041
[08:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:21:45] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply
[08:22:25] <kart_>	 hashar: no. It is done.
[08:22:35] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply
[08:22:57] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:23:10] <brouberol>	 !log redeploying eventstreams-internal in codfw T336041
[08:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:24] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply
[08:23:33] <hashar>	 scap cdb rebuild went from a steady 120 seconds median time to 180 seconds last week and 230 seconds this week
[08:23:52] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply
[08:24:12] <brouberol>	 !log redeploying eventstreams-internal in eqiad T336041
[08:24:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:24:15] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply
[08:24:15] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:24:19] <stashbot>	 T336041: Bring kafka-jumbo10[09-15] into service - https://phabricator.wikimedia.org/T336041
[08:24:39] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply
[08:25:38] <brouberol>	 !log redeploying mw-page-content-change-enrich in eqiad T336041
[08:25:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-codfw
[08:25:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:25:49] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[08:26:03] <logmsgbot>	 !log brouberol@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[08:26:36] <brouberol>	 !log redeploying mw-page-content-change-enrich in codfw T336041
[08:26:36] <logmsgbot>	 !log slyngshede@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on idm2001.wikimedia.org with reason: host reimage
[08:26:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:26:45] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[08:26:55] <logmsgbot>	 !log brouberol@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[08:27:40] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[08:28:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Remove the use of cergen certs from apiserver [puppet] - 10https://gerrit.wikimedia.org/r/958405 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:29:06] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idm2001.wikimedia.org with reason: host reimage
[08:30:42] <wikibugs>	 (03CR) 10Muehlenhoff: mariadb: Add grants for testreduce1002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957251 (https://phabricator.wikimedia.org/T345220) (owner: 10Ladsgroup)
[08:30:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-codfw
[08:32:22] <wikibugs>	 (03PS2) 10JMeybohm: kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826)
[08:34:28] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.maps.roll-restart-reboot rolling restart_daemons on A:maps-replica-eqiad
[08:35:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] kubernetes::master: Cleanup absent cergen resource [puppet] - 10https://gerrit.wikimedia.org/r/958426 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[08:36:16] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1146.eqiad.wmnet with OS bullseye
[08:39:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.maps.roll-restart-reboot (exit_code=0) rolling restart_daemons on A:maps-replica-eqiad
[08:41:27] <godog>	 !log remove MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from graphite - T346371
[08:41:29] <wikibugs>	 (03PS1) 10Slyngshede: P:idm fix log file path. [puppet] - 10https://gerrit.wikimedia.org/r/958888
[08:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:41:31] <stashbot>	 T346371: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite - https://phabricator.wikimedia.org/T346371
[08:42:11] <wikibugs>	 10SRE, 10Growth-Team, 10Observability-Metrics, 10Graphite: Delete MediaWiki.*.growthexperiments.taskcount.link_recommendation.* from Graphite - https://phabricator.wikimedia.org/T346371 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Thank you for reaching out @Urbanecm_WMF and letting us know about...
[08:42:50] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43342/console" [puppet] - 10https://gerrit.wikimedia.org/r/958888 (owner: 10Slyngshede)
[08:43:53] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2003.codfw.wmnet
[08:44:34] <godog>	 !log bounce benthos@webrequest_live to clear out old metrics
[08:44:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:44:51] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idm fix log file path. [puppet] - 10https://gerrit.wikimedia.org/r/958888 (owner: 10Slyngshede)
[08:47:46] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2003.codfw.wmnet
[08:48:12] <wikibugs>	 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10aborrero)
[08:48:52] <wikibugs>	 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloudswitch: codfw: figure out procurement - https://phabricator.wikimedia.org/T346724 (10aborrero) a:03cmooney hey @cmooney could you please advice on the cloudswitch models we would need in codfw to expand our capacity in that DCs?
[08:49:06] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] "tests aren't happy:" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[08:51:59] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add configuration for the new kubernetes node in codfw [homer/public] - 10https://gerrit.wikimedia.org/r/958489 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto)
[08:52:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto)
[08:53:48] <wikibugs>	 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10aborrero)
[08:54:14] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709) (owner: 10Giuseppe Lavagetto)
[08:54:37] <wikibugs>	 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: prepare codfw for expansion (racks, switches, ceph) - https://phabricator.wikimedia.org/T346661 (10aborrero)
[08:54:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] netbox: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/950167 (owner: 10Muehlenhoff)
[08:55:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:56:25] <wikibugs>	 10SRE, 10ops-codfw, 10User-aborrero, 10cloud-services-team (Hardware): cloud: codfw: decide on new ceph cluster details - https://phabricator.wikimedia.org/T346725 (10aborrero)
[08:56:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[08:58:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Add the configuration for the new wikikube hosts in eqiad [homer/public] - 10https://gerrit.wikimedia.org/r/958809 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[08:59:51] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2002.codfw.wmnet
[09:00:02] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[09:00:05] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43348/console" [puppet] - 10https://gerrit.wikimedia.org/r/957803 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[09:01:25] <wikibugs>	 (03CR) 10Jelto: [V: 03+1 C: 03+2] "lgtm, default Gemfile has 644 as well." [puppet] - 10https://gerrit.wikimedia.org/r/957803 (https://phabricator.wikimedia.org/T337570) (owner: 10Dduvall)
[09:02:31] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] kubernetes: add kubernetes10[27-56] to wikikube (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714) (owner: 10Giuseppe Lavagetto)
[09:03:20] <wikibugs>	 (03CR) 10Arnaudb: "unnecessary change" [puppet] - 10https://gerrit.wikimedia.org/r/957820 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[09:03:25] <wikibugs>	 (03Abandoned) 10Arnaudb: icinga: fix Arnaudb on icinga userlist [puppet] - 10https://gerrit.wikimedia.org/r/957820 (https://phabricator.wikimedia.org/T346610) (owner: 10Arnaudb)
[09:03:45] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2002.codfw.wmnet
[09:04:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:06:04] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:08:14] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM ml-staging-etcd2001.codfw.wmnet
[09:08:23] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[09:08:48] <wikibugs>	 (03Abandoned) 10Arturo Borrero Gonzalez: cloudgw: add NFS ratelimit [puppet] - 10https://gerrit.wikimedia.org/r/691154 (owner: 10Arturo Borrero Gonzalez)
[09:11:40] <wikibugs>	 10SRE, 10Data-Persistence, 10observability: Onboard arnaudb on Icinga - https://phabricator.wikimedia.org/T346610 (10ABran-WMF) 05Open→03Resolved
[09:12:08] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM ml-staging-etcd2001.codfw.wmnet
[09:13:56] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "Re: backups- looks good to me, no other references, should be safe to deploy, and files will continue in its current format for 3 months o" [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[09:14:40] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Stewards-and-global-tools, 10collaboration-services, 10vm-requests: 1 VMs requested for stewards - https://phabricator.wikimedia.org/T344164 (10LSobanski)
[09:19:38] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations: Automatic detection of inactive LDAP account - https://phabricator.wikimedia.org/T335478 (10dcaro) This might be interesting also for the cloud services projects (toolforge/cloudvps/...) as we have to manage also many abandoned/unresponsive developer accounts and...
[09:24:08] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] phabricator: Stop logging Bugzilla redirector misses [puppet] - 10https://gerrit.wikimedia.org/r/952047 (https://phabricator.wikimedia.org/T344884) (owner: 10Aklapper)
[09:28:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Increase the kafka-jumbo maximum message size to 10 MB [puppet] - 10https://gerrit.wikimedia.org/r/952160 (https://phabricator.wikimedia.org/T307959) (owner: 10Btullis)
[09:29:42] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Use default MySQL backend on package installation. [puppet] - 10https://gerrit.wikimedia.org/r/958892
[09:29:45] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb/microsites: move monitoring of static-codereview to monitoring profile [puppet] - 10https://gerrit.wikimedia.org/r/958474 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[09:30:28] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: apply the patch to override cloud.yaml on the cli [puppet] - 10https://gerrit.wikimedia.org/r/957942 (owner: 10David Caro)
[09:30:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43351/console" [puppet] - 10https://gerrit.wikimedia.org/r/958892 (owner: 10Slyngshede)
[09:32:39] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43353/console" [puppet] - 10https://gerrit.wikimedia.org/r/958892 (owner: 10Slyngshede)
[09:32:47] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Use default MySQL backend on package installation. [puppet] - 10https://gerrit.wikimedia.org/r/958892 (owner: 10Slyngshede)
[09:33:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch netboxdb to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/958894
[09:33:32] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958894 (owner: 10Muehlenhoff)
[09:35:28] <wikibugs>	 (03PS1) 10David Caro: openstack: fix source path for cli patch [puppet] - 10https://gerrit.wikimedia.org/r/958895
[09:35:50] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1134: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/958417 (owner: 10Marostegui)
[09:35:53] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] openstack: fix source path for cli patch [puppet] - 10https://gerrit.wikimedia.org/r/958895 (owner: 10David Caro)
[09:36:23] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 1%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52523 and previous config saved to /var/cache/conftool/dbconfig/20230919-093622-root.json
[09:36:46] <wikibugs>	 (03Abandoned) 10Elukey: WIP: improve Lift Wing's SLO/SLI calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/955958 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[09:36:50] <wikibugs>	 (03PS1) 10Slyngshede: P:IDM Add mysql Python driver [puppet] - 10https://gerrit.wikimedia.org/r/958896
[09:37:24] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43354/console" [puppet] - 10https://gerrit.wikimedia.org/r/958895 (owner: 10David Caro)
[09:38:12] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43355/console" [puppet] - 10https://gerrit.wikimedia.org/r/958896 (owner: 10Slyngshede)
[09:38:14] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/958896 (owner: 10Slyngshede)
[09:38:24] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] openstack: fix source path for cli patch [puppet] - 10https://gerrit.wikimedia.org/r/958895 (owner: 10David Caro)
[09:38:59] <wikibugs>	 (03PS1) 10Elukey: WIP: Improve ORES-Legacy's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897
[09:39:12] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:IDM Add mysql Python driver [puppet] - 10https://gerrit.wikimedia.org/r/958896 (owner: 10Slyngshede)
[09:40:02] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.kafka.roll-restart-brokers for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[09:40:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:41:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:42:50] <wikibugs>	 (03PS1) 10JMeybohm: Drop kubernetes cergen certs [labs/private] - 10https://gerrit.wikimedia.org/r/958898 (https://phabricator.wikimedia.org/T329826)
[09:42:52] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1147.eqiad.wmnet with OS bullseye
[09:44:56] <wikibugs>	 (03PS2) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620)
[09:45:51] <wikibugs>	 (03PS3) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620)
[09:48:35] <logmsgbot>	 !log slyngshede@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host idm2001.wikimedia.org with OS bookworm
[09:48:44] <wikibugs>	 10SRE, 10Bitu, 10Infrastructure-Foundations, 10Patch-For-Review: Build Debian packages for Bookworm - https://phabricator.wikimedia.org/T340721 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by slyngshede@cumin1001 for host idm2001.wikimedia.org with OS bookworm completed: - idm2001 (*...
[09:50:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2026 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:51:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[09:51:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 3%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52524 and previous config saved to /var/cache/conftool/dbconfig/20230919-095127-root.json
[09:56:13] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Drop kubernetes cergen certs [labs/private] - 10https://gerrit.wikimedia.org/r/958898 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[09:56:20] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) (owner: 10Ayounsi)
[09:56:29] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage
[09:56:32] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[09:56:40] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 (owner: 10Ayounsi)
[09:58:26] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1148.eqiad.wmnet with OS bullseye
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1000)
[10:01:33] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1147.eqiad.wmnet with reason: host reimage
[10:05:54] <icinga-wm>	 PROBLEM - puppet last run on puppetdb1002 is CRITICAL: CRITICAL: Puppet has been disabled for 604990 seconds, message: Stop Puppet/Puppetdb/Postgres to ensure nothing hits the legacy servers, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:06:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 5%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52525 and previous config saved to /var/cache/conftool/dbconfig/20230919-100632-root.json
[10:06:53] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:06:55] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:06:58] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Copy mesh.certificate_1.0.0 to 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/958482 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:07:41] <wikibugs>	 (03Merged) 10jenkins-bot: Copy mesh.certificate_1.0.0 to 1.0.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/958482 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:07:55] <wikibugs>	 (03Merged) 10jenkins-bot: mesh.certificate: Don't create certificates if mesh is not enabled [deployment-charts] - 10https://gerrit.wikimedia.org/r/958483 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: Update changeprop to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/958484 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:11:59] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage
[10:12:26] <wikibugs>	 (03PS7) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175)
[10:12:50] <icinga-wm>	 PROBLEM - puppet last run on puppetdb2002 is CRITICAL: CRITICAL: Puppet has been disabled for 604896 seconds, message: Stop Puppet/Puppetdb/Postgres/Nginx/microservice to ensure nothing hits the legacy servers, last run 7 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun
[10:15:01] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1148.eqiad.wmnet with reason: host reimage
[10:20:09] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43361/console" [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[10:21:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 10%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52526 and previous config saved to /var/cache/conftool/dbconfig/20230919-102137-root.json
[10:25:00] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1147.eqiad.wmnet with OS bullseye
[10:28:02] <wikibugs>	 (03CR) 10Milimetric: [C: 03+2] Fix typo in Jade content type name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[10:29:44] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[10:30:55] <wikibugs>	 (03CR) 10Majavah: Fix typo in Jade content type name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[10:32:39] <wikibugs>	 (03CR) 10Ladsgroup: Fix typo in Jade content type name (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957260 (https://phabricator.wikimedia.org/T345874) (owner: 10Ladsgroup)
[10:34:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] "I would make this a patchlevel version tbh. It's fully backwards compatible, does not change anything except internal logic and patchlevel" [deployment-charts] - 10https://gerrit.wikimedia.org/r/956441 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[10:34:20] <Emperor>	 !log codfw swift front-end swift package updates T346730
[10:34:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:34:24] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[10:36:30] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[10:36:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 25%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52527 and previous config saved to /var/cache/conftool/dbconfig/20230919-103642-root.json
[10:38:15] <wikibugs>	 (03CR) 10JMeybohm: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[10:38:29] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host an-worker1148.eqiad.wmnet with OS bullseye
[10:40:07] <wikibugs>	 (03PS5) 10JMeybohm: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[10:40:13] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[10:40:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add initial support to move cloudgw to profile::firewall using the nft provider [puppet] - 10https://gerrit.wikimedia.org/r/958480 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[10:42:05] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.138 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:42:17] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:42:37] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe2010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:43:59] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.140 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:44:53] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe2010 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.199 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:45:05] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:45:47] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042)
[10:46:55] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 04-1] "PCC SUCCESS (CORE_DIFF 26): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43364/console" [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[10:51:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 50%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52528 and previous config saved to /var/cache/conftool/dbconfig/20230919-105147-root.json
[10:54:02] <wikibugs>	 (03PS4) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620)
[10:57:16] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497)
[10:58:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[10:58:56] <wikibugs>	 (03PS6) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638)
[10:59:59] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[11:00:14] <wikibugs>	 (03CR) 10Elukey: profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[11:01:25] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:02:38] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042)
[11:02:56] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[11:03:00] <wikibugs>	 (03PS5) 10Elukey: Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620)
[11:03:43] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+1] wmnet: Update pc cnames to codfw [dns] - 10https://gerrit.wikimedia.org/r/958464 (https://phabricator.wikimedia.org/T346474) (owner: 10Marostegui)
[11:04:02] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042)
[11:05:13] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[11:06:43] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[11:06:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 75%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52529 and previous config saved to /var/cache/conftool/dbconfig/20230919-110651-root.json
[11:09:18] <Emperor>	 !log eqiad swift front-end swift package updates T346730
[11:09:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:09:22] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[11:10:38] <wikibugs>	 (03PS2) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497)
[11:11:49] <wikibugs>	 (03PS4) 10Arturo Borrero Gonzalez: openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042)
[11:12:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:12:17] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[11:12:31] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::redis Allow replication via IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958907
[11:14:40] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: pdns: refactor monitor checks [puppet] - 10https://gerrit.wikimedia.org/r/958904 (https://phabricator.wikimedia.org/T346042) (owner: 10Arturo Borrero Gonzalez)
[11:16:33] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.023 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:16:45] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 0.004 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:17:27] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe1010 is CRITICAL: CRITICAL - degraded: The following units failed: swift-proxy.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:18:27] <wikibugs>	 (03PS3) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497)
[11:20:13] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:21:41] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:21:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1134 (re)pooling @ 100%: Repooling after recloning db1128', diff saved to https://phabricator.wikimedia.org/P52530 and previous config saved to /var/cache/conftool/dbconfig/20230919-112156-root.json
[11:25:41] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.020 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:27:37] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.005 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:31:25] <wikibugs>	 (03PS1) 10Muehlenhoff: conntrackd: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497)
[11:31:49] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:36:08] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042)
[11:36:28] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] conntrackd: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:37:26] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042)
[11:38:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] conntrackd: Switch to ensure_packages() [puppet] - 10https://gerrit.wikimedia.org/r/958913 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:42:22] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[11:45:38] <wikibugs>	 10SRE, 10ops-eqiad, 10Patch-For-Review, 10User-aborrero, 10cloud-services-team (FY2023/2024-Q1): cloudservices1005: move to new setup - https://phabricator.wikimedia.org/T346042 (10aborrero) I misclicked on netbox and deleted the whole device entry for cloudservices1005, meaning it is no longer registere...
[11:46:19] <wikibugs>	 (03PS4) 10Muehlenhoff: Switch cloudgw/codfw1dev to profile::firewall [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497)
[11:46:34] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] LibreNMS report: remove MODEL_EXCLUDES filter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 (owner: 10Ayounsi)
[11:46:39] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] LibreNMS report: add equivalent model strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) (owner: 10Ayounsi)
[11:46:43] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] LibreNMS report: use black formating [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 (owner: 10Ayounsi)
[11:47:02] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] wmnet: Update maintenance.eqiad.wmnet to point to mwmaint2002 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/958472 (https://phabricator.wikimedia.org/T346474) (owner: 10Kamila Součková)
[11:47:08] <wikibugs>	 (03Merged) 10jenkins-bot: LibreNMS report: remove MODEL_EXCLUDES filter [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958813 (owner: 10Ayounsi)
[11:47:14] <wikibugs>	 (03Merged) 10jenkins-bot: LibreNMS report: add equivalent model strings [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958814 (https://phabricator.wikimedia.org/T331519) (owner: 10Ayounsi)
[11:47:17] <wikibugs>	 (03Merged) 10jenkins-bot: LibreNMS report: use black formating [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/958815 (owner: 10Ayounsi)
[11:48:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[11:49:11] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958905 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[11:50:54] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[11:50:54] <wikibugs>	 (03PS8) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175)
[11:51:00] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[11:51:06] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[11:55:40] <wikibugs>	 (03PS2) 10Slyngshede: IDM Switchover [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721)
[11:56:17] <wikibugs>	 (03PS4) 10Slyngshede: IDM: Deploy deb to idm1001. [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[11:58:09] <wikibugs>	 (03PS1) 10David Caro: m:openstack::clientpackages*: add patch to the list of packages [puppet] - 10https://gerrit.wikimedia.org/r/958917
[12:00:06] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1200)
[12:00:17] <wikibugs>	 (03PS9) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175)
[12:00:24] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43374/console" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro)
[12:01:42] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC looks good, a bit annoying that `present` moved to `installed` xd, but it's ok" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro)
[12:02:50] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10ayounsi) 05Resolved→03Open a:05Papaul→03Jhancock.wm This triggered netbox report alert ganeti2014 (WMF6747)  mismatched serials: XXXXX (netbox) != YYYYY (puppetdb) https://netbox.wikimedia.org/extras/r...
[12:03:32] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@91bb4a0]: Regular analytics weekly train [analytics/refinery@91bb4a0]
[12:05:40] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[12:08:10] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[12:09:39] <wikibugs>	 (03PS1) 10Muehlenhoff: conntrackd: Add explicit check [puppet] - 10https://gerrit.wikimedia.org/r/958918 (https://phabricator.wikimedia.org/T336497)
[12:09:43] <urbanecm>	 jouncebot: nowandnext
[12:09:43] <jouncebot>	 For the next 0 hour(s) and 50 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1200)
[12:09:44] <jouncebot>	 In 0 hour(s) and 50 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1300)
[12:09:59] <wikibugs>	 (03PS2) 10Urbanecm: beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556)
[12:10:19] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm)
[12:10:25] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@91bb4a0]: Regular analytics weekly train [analytics/refinery@91bb4a0] (duration: 06m 53s)
[12:11:01] <wikibugs>	 (03Merged) 10jenkins-bot: beta: Do not reference image-suggestion-api.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954620 (https://phabricator.wikimedia.org/T345556) (owner: 10Urbanecm)
[12:12:21] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Configure ECMP hashing function on QFX5120 platform - https://phabricator.wikimedia.org/T339852 (10cmooney) 05Open→03Resolved a:03cmooney Closing.  If we want to do it on EVPN/VXLAN devices we can revisit in future.
[12:12:23] <Emperor>	 !log ms-be204{5,6} swift package updates T346730
[12:12:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:12:28] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[12:13:17] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43376/console" [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede)
[12:14:34] <Emperor>	 !log ms-be2047 swift package updates T346730
[12:14:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:15:17] <wikibugs>	 (03CR) 10Slyngshede: "Not currently a huge problem as everything in Redis is ephemeral, but it should still work." [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede)
[12:16:17] <wikibugs>	 (03PS1) 10Kamila Součková: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330)
[12:16:20] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958918 (https://phabricator.wikimedia.org/T336497) (owner: 10Muehlenhoff)
[12:17:04] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@91bb4a0] (thin): Regular analytics weekly train THIN [analytics/refinery@91bb4a0]
[12:17:09] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@91bb4a0] (thin): Regular analytics weekly train THIN [analytics/refinery@91bb4a0] (duration: 00m 05s)
[12:17:19] <wikibugs>	 (03CR) 10Muehlenhoff: C:idm::redis Allow replication via IPv6 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/958907 (owner: 10Slyngshede)
[12:17:23] <wikibugs>	 (03CR) 10Kamila Součková: [C: 04-2] "to be merged during switchover" [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[12:17:29] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@91bb4a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@91bb4a0]
[12:18:07] <wikibugs>	 (03PS2) 10Urbanecm: Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353)
[12:18:43] <Emperor>	 !log ms-be2048 swift package updates T346730
[12:18:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:18:47] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[12:18:51] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm)
[12:19:32] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@91bb4a0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@91bb4a0] (duration: 02m 03s)
[12:19:33] <wikibugs>	 (03Merged) 10jenkins-bot: Growth: Welcome survey user research: Use a generic question [mediawiki-config] - 10https://gerrit.wikimedia.org/r/952351 (https://phabricator.wikimedia.org/T342353) (owner: 10Urbanecm)
[12:20:22] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[12:20:33] <wikibugs>	 (03PS1) 10Muehlenhoff: idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921
[12:20:57] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff)
[12:21:51] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: MediaWiki - https://phabricator.wikimedia.org/T346474 (10kamila)
[12:22:20] <Emperor>	 !log ms-be20[49-59] swift package updates T346730
[12:22:23] <icinga-wm>	 RECOVERY - Router interfaces on cr1-esams is OK: OK: host 185.15.59.128, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[12:22:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:28] <wikibugs>	 (03PS1) 10BBlack: varnish mem: unify global default [puppet] - 10https://gerrit.wikimedia.org/r/958922
[12:23:36] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[12:25:34] <wikibugs>	 (03PS1) 10BBlack: varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093)
[12:26:18] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[12:27:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack)
[12:27:39] <wikibugs>	 (03PS2) 10Muehlenhoff: idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921
[12:28:58] <wikibugs>	 (03PS2) 10BBlack: varnish mem: unify global default [puppet] - 10https://gerrit.wikimedia.org/r/958922
[12:29:00] <wikibugs>	 (03PS2) 10BBlack: varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093)
[12:29:02] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] m:openstack::clientpackages*: add patch to the list of packages [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro)
[12:30:31] <wikibugs>	 (03PS3) 10BBlack: varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093)
[12:31:21] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack)
[12:32:58] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff)
[12:33:54] <wikibugs>	 (03CR) 10Elukey: Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[12:35:47] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: "Thanks for the reviews!" [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[12:37:07] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:37:22] <wikibugs>	 (03PS3) 10Muehlenhoff: idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921
[12:38:29] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:38:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff)
[12:40:53] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] alertmanager: create ml team alerts [puppet] - 10https://gerrit.wikimedia.org/r/958072 (https://phabricator.wikimedia.org/T346151) (owner: 10Ilias Sarantopoulos)
[12:44:18] <Emperor>	 !log ms-be20[60-73] swift package updates T346730
[12:44:19] <wikibugs>	 (03CR) 10BBlack: "PCC confirms nop: https://puppet-compiler.wmflabs.org/output/958922/43377/" [puppet] - 10https://gerrit.wikimedia.org/r/958922 (owner: 10BBlack)
[12:44:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:44:21] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[12:44:33] <wikibugs>	 (03CR) 10BBlack: "PCC confirms nop here too: https://puppet-compiler.wmflabs.org/output/958923/43379/" [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack)
[12:45:15] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] varnish mem: unify global default [puppet] - 10https://gerrit.wikimedia.org/r/958922 (owner: 10BBlack)
[12:45:21] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[12:45:23] <wikibugs>	 (03CR) 10BBlack: [C: 03+2] varnish: unify vsl_size to new default [puppet] - 10https://gerrit.wikimedia.org/r/958923 (https://phabricator.wikimedia.org/T253093) (owner: 10BBlack)
[12:53:57] <wikibugs>	 (03CR) 10Klausman: [C: 03+1] Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[12:55:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:57:11] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[12:57:27] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[12:58:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor I � Unicode. All rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1300).
[13:00:04] <jouncebot>	 Aca: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:16] * Aca waves
[13:00:42] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] IDM Switchover [dns] - 10https://gerrit.wikimedia.org/r/957674 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[13:01:05] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929
[13:01:15] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[13:01:43] * TheresNoTime can deploy
[13:01:49] <Aca>	 noicee
[13:02:03] <wikibugs>	 (03PS3) 10Samtar: Add namespace aliases to shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958421 (https://phabricator.wikimedia.org/T346588) (owner: 10Acamicamacaraca)
[13:02:29] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:02:38] <urbanecm>	 thanks TheresNoTime
[13:02:47] <TheresNoTime>	 ideally we'd get a +1 on that first, if you have a sec urbanecm?
[13:03:20] <urbanecm>	 Aca discussed the idea with me first, and i don't see a reason why not (otoh, i don't see a reason why yes either :D). if you want me to review the specific patch, can do.
[13:03:55] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:04:05] * TheresNoTime will deploy
[13:04:10] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958421 (https://phabricator.wikimedia.org/T346588) (owner: 10Acamicamacaraca)
[13:04:20] <urbanecm>	 sounds good :)
[13:05:01] <wikibugs>	 (03Merged) 10jenkins-bot: Add namespace aliases to shwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958421 (https://phabricator.wikimedia.org/T346588) (owner: 10Acamicamacaraca)
[13:05:02] <Aca>	 yeah, appropriate documentation will be created on-wiki in the project namespace to enhance the usage of the shortcuts/aliases
[13:05:09] <wikibugs>	 (03PS5) 10Slyngshede: P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[13:05:19] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.roll-restart-brokers (exit_code=0) for Kafka A:kafka-jumbo-eqiad cluster: Roll restart of jvm daemons.
[13:05:34] <logmsgbot>	 !log samtar@deploy1002 Started scap: Backport for [[gerrit:958421|Add namespace aliases to shwiki (T346588)]]
[13:05:37] <stashbot>	 T346588: Add namespace aliases to shwiki - https://phabricator.wikimedia.org/T346588
[13:06:38] <wikibugs>	 (03PS6) 10Slyngshede: P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[13:07:03] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[13:07:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[13:07:38] <wikibugs>	 (03PS10) 10Brouberol: Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041)
[13:07:40] <wikibugs>	 (03PS10) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041)
[13:07:42] <wikibugs>	 (03PS10) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041)
[13:07:44] <wikibugs>	 (03PS10) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041)
[13:07:46] <wikibugs>	 (03PS10) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[13:08:09] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[13:08:33] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@2d9d6d0]: Regular analytics weekly train [analytics/refinery@2d9d6d0]
[13:09:39] <TheresNoTime>	 Aca: (almost ready for testing), but remind me, this requires a run of `namespaceDupes.php` after, correct?
[13:09:41] <wikibugs>	 (03PS7) 10Slyngshede: P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721)
[13:10:10] <urbanecm>	 that is correct for the namespaceDupes q.
[13:10:14] <TheresNoTime>	 (ta)
[13:10:15] <Aca>	 I think yes.
[13:11:51] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[13:12:20] <wikibugs>	 (03PS10) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175)
[13:12:44] <TheresNoTime>	 (`K8s images build/push` step is taking longer than normal, fwiw)
[13:13:11] <urbanecm>	 Assuming it's running, nothing to worry about. 
[13:14:25] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@2d9d6d0]: Regular analytics weekly train [analytics/refinery@2d9d6d0] (duration: 05m 52s)
[13:14:32] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] P:IDM: Failover Redis [puppet] - 10https://gerrit.wikimedia.org/r/957676 (https://phabricator.wikimedia.org/T340721) (owner: 10Slyngshede)
[13:14:41] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@2d9d6d0] (thin): Regular analytics weekly train THIN [analytics/refinery@2d9d6d0]
[13:14:45] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@2d9d6d0] (thin): Regular analytics weekly train THIN [analytics/refinery@2d9d6d0] (duration: 00m 04s)
[13:15:01] <logmsgbot>	 !log jebe@deploy1002 Started deploy [analytics/refinery@2d9d6d0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2d9d6d0]
[13:15:20] <Emperor>	 !log ms-be10[44-60] swift package updates T346730
[13:15:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:15:26] <stashbot>	 T346730: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730
[13:16:22] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[13:17:08] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [analytics/refinery@2d9d6d0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2d9d6d0] (duration: 02m 06s)
[13:17:22] <TheresNoTime>	 urbanecm: looking at the `scap-image-build-and-push-log`, it's the push that's taking a long time.. though iirc that's happened before, so not too concerned
[13:17:37] <wikibugs>	 (03CR) 10Elukey: Improve ML team's SLO calculations (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[13:17:49] <TheresNoTime>	 (as I type that it finishes, of course)
[13:18:05] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch netboxdb to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/958894 (owner: 10Muehlenhoff)
[13:19:47] <icinga-wm>	 PROBLEM - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:20:54] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb/microsites: remove static-codereview resources [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309) (owner: 10Jelto)
[13:21:04] <wikibugs>	 (03PS2) 10Jelto: miscweb/microsites: remove static-codereview resources [puppet] - 10https://gerrit.wikimedia.org/r/958475 (https://phabricator.wikimedia.org/T346309)
[13:21:10] <icinga-wm>	 ACKNOWLEDGEMENT - Check systemd state on idm1001 is CRITICAL: CRITICAL - degraded: The following units failed: rq-bitu.service Slyngshede Switch-over, waiting for reimage. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:22:28] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff)
[13:24:22] <wikibugs>	 (03PS1) 10Btullis: Block any open angle brackets in Archiva mirrored URLs [puppet] - 10https://gerrit.wikimedia.org/r/958930 (https://phabricator.wikimedia.org/T318962)
[13:24:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10Gehel) Approved for all the elastic and wdqs / wcqs access
[13:24:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] idm::redis: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958921 (owner: 10Muehlenhoff)
[13:24:57] <wikibugs>	 (03PS1) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931
[13:25:12] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:25:56] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:26:27] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:26:52] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:27:12] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43382/console" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro)
[13:27:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host flerovium.eqiad.wmnet
[13:28:10] <logmsgbot>	 !log samtar@deploy1002 samtar and aleksandar: Backport for [[gerrit:958421|Add namespace aliases to shwiki (T346588)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[13:28:14] <stashbot>	 T346588: Add namespace aliases to shwiki - https://phabricator.wikimedia.org/T346588
[13:28:18] <TheresNoTime>	 Aca: can you test? ^
[13:28:25] <Aca>	 checking it now
[13:29:04] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1011.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957919 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:29:16] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (CORE_DIFF 26): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43381/console" [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[13:29:29] <wikibugs>	 (03CR) 10Btullis: Configure kafka-jumbo1015.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:29:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops: cr*-eqsin long poll times from librenms - https://phabricator.wikimedia.org/T346606 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Opened {T346759} for followups, this is done
[13:30:05] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43384/console" [puppet] - 10https://gerrit.wikimedia.org/r/958917 (owner: 10David Caro)
[13:32:32] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43385/console" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[13:33:08] <wikibugs>	 (03CR) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:33:19] <wikibugs>	 (03CR) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:33:27] <wikibugs>	 (03PS2) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931
[13:33:40] <Aca>	 TheresNoTime: Looks good to me. Both Cyrillic and Latin aliases seems to work in the search box.
[13:33:47] <TheresNoTime>	 ack
[13:33:50] <logmsgbot>	 !log samtar@deploy1002 samtar and aleksandar: Continuing with sync
[13:34:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host flerovium.eqiad.wmnet
[13:34:25] <_joe_>	 jouncebot: now
[13:34:25] <jouncebot>	 For the next 0 hour(s) and 25 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1300)
[13:35:16] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] profile::service_proxy::envoy: rename uses_ingress to sets_sni (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/956379 (https://phabricator.wikimedia.org/T346638) (owner: 10Elukey)
[13:37:01] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) @ayounsi I should be able to update that on the server. but I will need to reboot it to apply changes.
[13:37:22] <wikibugs>	 (03Abandoned) 10Ssingh: Remove most knams references/comments [dns] - 10https://gerrit.wikimedia.org/r/953681 (owner: 10BCornwall)
[13:37:28] <wikibugs>	 (03PS3) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931
[13:39:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:40:10] <wikibugs>	 (03PS11) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[13:40:25] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:41:23] <wikibugs>	 (03PS11) 10Brouberol: Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041)
[13:41:29] <wikibugs>	 (03CR) 10Brouberol: [V: 03+2] Configure kafka-jumbo1012.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957920 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:42:17] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43386/console" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[13:42:28] <wikibugs>	 (03PS11) 10Brouberol: Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041)
[13:42:30] <TheresNoTime>	 This window may overrun, deployment is being very slow at the moment
[13:42:52] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: fix email_confgs for ml team [puppet] - 10https://gerrit.wikimedia.org/r/958934
[13:43:21] <wikibugs>	 (03PS2) 10Filippo Giunchedi: alertmanager: fix email_confgs for ml team [puppet] - 10https://gerrit.wikimedia.org/r/958934
[13:44:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:44:10] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1013.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957921 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:44:48] * TheresNoTime has failures
[13:44:58] <wikibugs>	 (03PS11) 10Brouberol: Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041)
[13:45:23] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:45:25] <icinga-wm>	 PROBLEM - Check systemd state on ms-be1057 is CRITICAL: CRITICAL - degraded: The following units failed: swift-account.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:38] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] "Sorry :(" [puppet] - 10https://gerrit.wikimedia.org/r/958934 (owner: 10Filippo Giunchedi)
[13:45:39] <taavi>	 TheresNoTime: with which hosts?
[13:45:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:45:50] <TheresNoTime>	 taavi: https://phabricator.wikimedia.org/P52532
[13:46:25] <logmsgbot>	 !log jebe@deploy1002 Started deploy [airflow-dags/analytics@6b9855a]: (no justification provided)
[13:46:37] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.decommission for hosts an-test-client1001.eqiad.wmnet
[13:46:49] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1014.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957922 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:47:05] <wikibugs>	 (03PS4) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931
[13:47:08] <logmsgbot>	 !log jebe@deploy1002 Finished deploy [airflow-dags/analytics@6b9855a]: (no justification provided) (duration: 00m 43s)
[13:47:29] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: kubernetes: default partman recipe for nodes [puppet] - 10https://gerrit.wikimedia.org/r/958463
[13:47:31] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: wikikube: put the new codfw nodes in production [puppet] - 10https://gerrit.wikimedia.org/r/958487 (https://phabricator.wikimedia.org/T345709)
[13:47:33] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958488 (https://phabricator.wikimedia.org/T345709)
[13:47:35] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: kubernetes: add kubernetes10[27-56] to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/958810 (https://phabricator.wikimedia.org/T346714)
[13:47:37] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: conftool: add new k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/958811 (https://phabricator.wikimedia.org/T346714)
[13:47:49] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[13:47:57] <TheresNoTime>	 Aca: FYI, I believe the failure caused a rollback
[13:47:59] <taavi>	 claime: / serviceops: ^ TNT's error above seems to be with k8s capacity in codfw
[13:48:07] <taavi>	 or _joe_ maybe?
[13:48:10] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:48:15] <icinga-wm>	 RECOVERY - Check systemd state on ms-be1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:31] <claime>	 taavi: checking
[13:48:39] <_joe_>	 claime: are you looking, ack
[13:49:04] <Aca>	 hmm
[13:49:10] <TheresNoTime>	 Aca & taavi: I have a meeting on the hour, so will not be able to re-deploy if so. Are you (taavi) able to, if needed?
[13:49:18] <wikibugs>	 (03PS12) 10Brouberol: Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041)
[13:49:21] <taavi>	 this is what I see in kubectl describe pod:   Warning  FailedScheduling  60s                default-scheduler  0/24 nodes are available: 18 Insufficient cpu, 2 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 4 node(s) had taint {dedicated: kask}, that the pod didn't tolerate.
[13:49:32] <taavi>	 mw-api-int in codfw
[13:49:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:49:43] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (CORE_DIFF 6): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43388/console" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[13:50:16] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Configure kafka-jumbo1015.eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957923 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[13:50:17] <_joe_>	 taavi: ack, thanks
[13:50:35] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating DNS record of kuberbetes2026 - jhancock@cumin2002"
[13:50:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[13:50:55] <claime>	 taavi: thanks, it's the usual resource issue...
[13:50:58] <Aca>	 TheresNoTime Not a big deal. I can reschedule, if needed.
[13:51:14] <_joe_>	 it will be resolved for good on thursday
[13:51:49] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.dns.netbox
[13:51:52] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[13:52:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Updating DNS record of kuberbetes2026 - jhancock@cumin2002"
[13:52:01] <logmsgbot>	 !log jhancock@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[13:52:05] <logmsgbot>	 !log jhancock@cumin2002 START - Cookbook sre.dns.netbox
[13:52:33] <elukey>	 !log clean old puppet certs kafka_logging-eqiad_broker
[13:52:34] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10MoritzMuehlenhoff) This is just a serial, does this really need a reboot os the OS? (We can arrange for that, but the server would need to be drained first)
[13:52:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:52:34] <claime>	 yeah, in the meantime I'll do the same as 0ffa20bb0b1f712388a7a2945f0b291c8e4b7449
[13:52:35] <elukey>	 uff
[13:52:40] * elukey amends sal
[13:52:45] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:53:10] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mcrouter: Specify missing CXXFLAGS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999)
[13:53:29] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:53:30] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts an-test-client1001.eqiad.wmnet
[13:53:31] <stashbot>	 stevemunene@cumin1001: Failed to log message to wiki. Somebody should check the error logs.
[13:53:37] <logmsgbot>	 !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[13:54:00] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+2] Remove mention of an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/957862 (https://phabricator.wikimedia.org/T329363) (owner: 10Stevemunene)
[13:54:37] <wikibugs>	 (03PS1) 10Clément Goubert: Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424
[13:54:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert)
[13:55:08] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::redis bind to both IPv4 and IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/958936
[13:55:15] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack.phabricator: Don't fail when logging to a restricted task - https://phabricator.wikimedia.org/T335879 (10Volans) Sure, but wmflib is a general purpose library and shouldn't make that assumption. So I'd rather do that via a parameter so that th...
[13:55:21] <TheresNoTime>	 Okay, scap is just doing the php-fpm restarts and then that deployment rollback is done
[13:56:09] <wikibugs>	 10SRE-swift-storage: Install new swift packages to ms swift clusters (KR) - https://phabricator.wikimedia.org/T346730 (10MatthewVernon)
[13:56:19] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Update refinery-job jar version for analytics jobs [puppet] - 10https://gerrit.wikimedia.org/r/955893 (https://phabricator.wikimedia.org/T344616) (owner: 10Joal)
[13:56:25] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43390/console" [puppet] - 10https://gerrit.wikimedia.org/r/958936 (owner: 10Slyngshede)
[13:56:57] <claime>	 TheresNoTime: Is it rolling back the whole backport? We can schedule it again after the traffic/services switchover
[13:57:10] <icinga-wm>	 PROBLEM - Check systemd state on kubestagemaster2002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:57:19] <_joe_>	 claime: AIUI it should not
[13:57:25] <logmsgbot>	 !log samtar@deploy1002 Finished scap: Backport for [[gerrit:958421|Add namespace aliases to shwiki (T346588)]] (duration: 51m 50s)
[13:57:28] <stashbot>	 T346588: Add namespace aliases to shwiki - https://phabricator.wikimedia.org/T346588
[13:57:30] <claime>	 Yeah I thought it'd just rollback k8s
[13:57:35] <_joe_>	 TheresNoTime: did it rollback everything?
[13:57:36] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[13:57:40] <TheresNoTime>	 ah, no, just k8s
[13:57:47] <TheresNoTime>	 Aca: should be live now then
[13:58:09] <_joe_>	 claime: we can do a k8s only deployment manually I guess?
[13:58:18] <claime>	 _joe_: yeah we can
[13:58:28] <wikibugs>	 (03PS1) 10Brouberol: Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041)
[13:58:38] <Aca>	 TheresNoTime Oh, okie. Thank you for handling all this!
[13:58:53] <TheresNoTime>	 !log `[samtar@mwmaint1002 ~]$ mwscript namespaceDupes.php --wiki shwiki --fix` T346588
[13:58:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:59:25] * TheresNoTime away!
[13:59:44] <icinga-wm>	 PROBLEM - Check systemd state on kafka-jumbo1015 is CRITICAL: CRITICAL - degraded: The following units failed: kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:59:46] <_joe_>	 claime: I can handle this if you need to follow the switchover
[14:00:04] <jouncebot>	 kamila_: Your horoscope predicts another unfortunate Datacenter switchover: Services + Traffic deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1400).
[14:00:12] <claime>	 _joe_: appreciated, yeah, thanks
[14:00:21] <wikibugs>	 10SRE, 10ops-codfw: ganeti2014: broken RAM / mainboard - https://phabricator.wikimedia.org/T341546 (10Jhancock.wm) yes, it's a bios setting. so it would require a reboot to apply. I should have caught that when I was fixing it the first time around so that's my bad.
[14:00:25] <claime>	 jouncebot being very ominous
[14:00:29] <kamila_>	 XD
[14:00:42] <_joe_>	 TheresNoTime: the window is over, I'm gonna sync k8s by hand
[14:00:52] <TheresNoTime>	 _joe_: ack :)
[14:00:55] <logmsgbot>	 !log kamila@deploy1002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330
[14:00:55] <claime>	 ok so just a reminder, we're coordinating on -sre, I'll be monitoring -operations for issuess
[14:01:03] <stashbot>	 T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330
[14:01:12] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[14:01:17] <Aca>	 TheresNoTime Do let me know if there are pages to fix
[14:01:19] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Services - T346330
[14:01:31] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Service...
[14:03:06] <wikibugs>	 (03PS2) 10Brouberol: Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041)
[14:03:39] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert)
[14:05:06] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster2002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[14:05:52] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:05:55] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T346708 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[14:07:25] <icinga-wm>	 PROBLEM - Kafka Broker Server #page on kafka-jumbo1015 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:07:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:08:08] <akosiaris>	 kafka-jumbo1015 is apparently a new host, nothing to do with the switchover
[14:08:15] <_joe_>	 yeah
[14:08:43] <jelto>	 I'll ack the alert
[14:08:53] <akosiaris>	 thanks
[14:09:34] <_joe_>	 is jenkins broken?
[14:10:13] <wikibugs>	 (03PS11) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175)
[14:10:19] <_joe_>	 yeah this job is clearly broken https://integration.wikimedia.org/ci/job/helm-lint/12948/console
[14:10:38] <akosiaris>	 _joe_: I see multiple jobs proceeding, so it doesn't seem something like CI infra wide
[14:11:06] <_joe_>	 I'm gonna be bold and add a V+2 to the change
[14:11:08] <hashar>	 it is too long linting things? :/
[14:11:20] <icinga-wm>	 PROBLEM - Kafka broker TLS certificate validity on kafka-jumbo1015 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:11:21] <_joe_>	 hashar: that job should usually finish in 1 minute
[14:11:37] <akosiaris>	 7 already? interesting
[14:11:42] <hashar>	 jenkins+ 3515836  0.0  0.1 1495712 42068 ?       Sl   14:03   0:00  |           \_ docker run --entrypoint=/usr/bin/find --user=nobody --volume /srv/jenkins/workspace/helm-lint:/workspace --security-opt seccomp=unconfined --init --rm --la
[14:11:42] <hashar>	 jenkins+ 3515838  0.0  0.0      0     0 ?        Z    14:03   0:00  |               \_ [bash] <defunct>
[14:11:56] <hashar>	 from integration-agent-docker-1042.integration.eqiad1.wikimedia.cloud
[14:12:01] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert)
[14:12:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:12:48] <icinga-wm>	 RECOVERY - Check systemd state on kafka-jumbo1015 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:13:06] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Revert "mediawiki: Reduce requests for canaries"" [deployment-charts] - 10https://gerrit.wikimedia.org/r/958424 (owner: 10Clément Goubert)
[14:13:15] <icinga-wm>	 RECOVERY - Kafka Broker Server #page on kafka-jumbo1015 is OK: PROCS OK: 1 process with command name java, args Kafka /etc/kafka/server.properties https://wikitech.wikimedia.org/wiki/Kafka/Administration
[14:13:20] <_joe_>	 and indeed, the job worked immediately
[14:13:32] <hashar>	 and the other got unlocked somehow
[14:13:46] <icinga-wm>	 RECOVERY - Kafka broker TLS certificate validity on kafka-jumbo1015 is OK: SSL OK - Certificate kafka-jumbo1015.eqiad.wmnet valid until 2024-09-18 13:48:00 +0000 (expires in 364 days) https://wikitech.wikimedia.org/wiki/Kafka/Administration%23Renew_TLS_certificate
[14:14:19] <hashar>	 looks like it has spend 8 minutes trying to spin up the container
[14:15:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/958936 (owner: 10Slyngshede)
[14:16:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:16:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10VRiley-WMF) cloudtastic1007 A 2. U 26. port 17 CableID 5245 cloudtastic1008 B 2. U 25. port 36 CableID 5006 cloudtastic1009 C 2. U 27. por...
[14:17:03] <jinxer-wm>	 (ProbeDown) firing: (3) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:17:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:19:35] <wikibugs>	 (03PS1) 10Jclark-ctr: add pki1002 to T342892 [puppet] - 10https://gerrit.wikimedia.org/r/958943 (https://phabricator.wikimedia.org/T342892)
[14:20:23] <logmsgbot>	 !log kamila@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 (duration: 19m 27s)
[14:20:23] <logmsgbot>	 !log oblivian@deploy1002 Started scap: (no justification provided)
[14:20:27] <stashbot>	 T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330
[14:20:45] <claime>	 We're releasing the scap lock for an emergency mw-on-k8s fix, please DO NOT RUN SCAP RIGHT NOW
[14:21:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:22:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:22:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:22:36] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:22:36] <claime>	 ^ expected
[14:22:42] <claime>	 (PHPFPMTooBusy)
[14:23:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:25:07] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:25:35] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: (no justification provided) (duration: 05m 44s)
[14:26:26] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:26:32] <wikibugs>	 (03CR) 10Fabfur: "Confirm that with the latest CR tests now are all fine (removed api.w.o from regex  and tests as is managed by misc)" [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[14:26:50] <wikibugs>	 (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[14:27:03] <jinxer-wm>	 (ProbeDown) firing: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:27:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:28:19] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=thumbor
[14:28:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:28:47] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all services in eqiad: Datacenter Switchover: Services - T346330
[14:28:51] <stashbot>	 T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330
[14:28:57] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10ops-monitoring-bot) kamila@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all services in eqiad: Datacenter Switchover: Service...
[14:29:51] <wikibugs>	 (03CR) 10Fabfur: varnish: add more domains for mobile redirect (*.wikimedia.org) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[14:30:32] <logmsgbot>	 !log kamila@deploy1002 Locking from deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330
[14:31:05] <wikibugs>	 (03PS1) 10Anzx: add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043)
[14:31:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag  - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[14:31:45] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx)
[14:32:03] <kamila_>	 !log Switch deployment server - T346330
[14:32:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:32:50] <wikibugs>	 (03PS4) 10Kamila Součková: wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330)
[14:33:02] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[14:33:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (main) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=main&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:33:11] <wikibugs>	 (03CR) 10Kamila Součková: [V: 03+2 C: 03+2] wmnet: switch deployment CNAMEs to codfw [dns] - 10https://gerrit.wikimedia.org/r/957734 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[14:33:25] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift
[14:33:26] <wikibugs>	 (03PS3) 10Kamila Součková: Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330)
[14:33:28] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - thumbor_8800: Servers kubernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:33:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[14:33:54] <logmsgbot>	 !log cgoubert@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro
[14:34:45] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Switch deployment server to deploy2002.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/957736 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[14:35:22] <wikibugs>	 (03PS2) 10Anzx: add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043)
[14:36:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043) (owner: 10Anzx)
[14:36:16] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:36:21] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-rw,name=eqiad
[14:36:27] <logmsgbot>	 !log oblivian@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-rw,name=codfw
[14:36:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:37:03] <jinxer-wm>	 (ProbeDown) resolved: (4) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip6)   - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:37:18] <wikibugs>	 (03PS3) 10Anzx: add throttle rule for UIUC Wikipedia edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/958946 (https://phabricator.wikimedia.org/T346043)
[14:37:58] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:37:58] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:38:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:38:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:38:45] <jinxer-wm>	 (Traffic bill over quota) firing: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[14:39:15] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[14:39:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:39:24] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:39:42] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:40:07] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:40:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:41:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:41:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] varnish: add more domains for mobile redirect (*.wikimedia.org) [puppet] - 10https://gerrit.wikimedia.org/r/957292 (https://phabricator.wikimedia.org/T344175) (owner: 10Fabfur)
[14:42:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[14:42:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:42:46] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: send_tile_invalidations.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:43:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:44:08] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:44:22] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:45:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1026-56} - jclark@cumin1001"
[14:46:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:46:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbproxy1026-56} - jclark@cumin1001"
[14:46:17] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[14:46:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:46:55] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "This looks pretty good, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[14:46:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:47:29] <wikibugs>	 (03CR) 10Muehlenhoff: mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999)
[14:47:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:10] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:22] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:48:32] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:48:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1007
[14:49:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:49:06] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:49:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:49:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:49:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1007
[14:49:49] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1008
[14:50:00] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1009
[14:50:07] <kamila_>	 !nowandnext
[14:50:21] <taavi>	 jouncebot: nowandnext
[14:50:22] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): Datacenter switchover: Services + Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1400)
[14:50:22] <jouncebot>	 In 1 hour(s) and 9 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1600)
[14:50:30] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:50:40] <moritzm>	 !log installing python-werkzeug security updates
[14:50:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:13] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1008
[14:51:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:51:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1009
[14:51:32] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr)
[14:51:36] <kamila_>	 thanks taavi
[14:51:36] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010
[14:51:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:51:47] <kamila_>	 (I will rememer one day XD)
[14:52:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:52:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:52:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010
[14:53:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:53:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:53:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:53:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:54:12] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:54:52] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:54:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:55:02] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Harmonize thumbor's eqiad/codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958953
[14:55:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[14:55:14] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:55:14] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:18] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1008.mgmt.eqiad.wmnet with reboot policy FORCED
[14:56:33] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1009.mgmt.eqiad.wmnet with reboot policy FORCED
[14:56:34] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED
[14:56:38] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED
[14:56:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:52] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1027 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:56:52] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:56:56] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:56:57] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM (waiting for the switchover o finish)" [puppet] - 10https://gerrit.wikimedia.org/r/958943 (https://phabricator.wikimedia.org/T342892) (owner: 10Jclark-ctr)
[14:57:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[14:57:34] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:58:45] <jinxer-wm>	 (Traffic bill over quota) resolved: Alert for device cr2-drmrs.wikimedia.org - Traffic bill over quota got acknowledged   - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota
[14:58:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Harmonize thumbor's eqiad/codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958953 (owner: 10Alexandros Kosiaris)
[14:59:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:44] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[14:59:48] <wikibugs>	 (03Merged) 10jenkins-bot: Harmonize thumbor's eqiad/codfw replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958953 (owner: 10Alexandros Kosiaris)
[14:59:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:00:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:00:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:00:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:02:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:02:20] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:02:23] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:02:27] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[15:02:47] <akosiaris>	 !log increase thumbor's pods in codfw to 48 to harmonize with eqiad
[15:02:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:37] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) So this system only supported UEFI mode, which we've not supported installing within WMF.  If I can recall correctly, a few years ago we had a...
[15:03:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1027 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:03:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:04:06] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:04:18] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10RobH) IRC update: Chatted with Moritz in IRC and we're no where near supporting UEFI mode anytime in near to mid term.  We should likely return these.
[15:04:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:05:08] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:05:18] <logmsgbot>	 !log kamila@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: Datacenter Switchover: Services & Traffic - T346330 (duration: 34m 46s)
[15:05:23] <stashbot>	 T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330
[15:05:32] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:06:00] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: (no justification provided)
[15:06:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:06:09] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED
[15:06:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:06:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:06:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:06:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:07:28] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover, 10Patch-For-Review: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[15:07:47] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: "LGTM. Thanks for working on this!" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[15:08:00] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:08:55] <wikibugs>	 (03PS2) 10Kamila Součková: traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330)
[15:09:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:09:50] <_joe_>	 uhhh now waht
[15:10:04] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:10:46] <_joe_>	 claime: it looks like mw-web can't sustain all 5% of traffic in a single dc
[15:10:55] <claime>	 Apparently yeah
[15:11:00] <icinga-wm>	 RECOVERY - Check systemd state on kubestagemaster2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:11:25] <claime>	 I'm sorry I'm balancing with scap rn
[15:11:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[15:11:30] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:11:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:08] <claime>	 _joe_: It's only the canaries though
[15:12:19] <claime>	 Are they getting a bigger portion of traffic than they should?
[15:12:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[15:12:28] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:12:35] <_joe_>	 claime: we're at 250 rps
[15:12:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:13:01] <claime>	 _joe_: canaries are getting 40rps
[15:13:06] <claime>	 _joe_: it's 2 replicas, out of 14
[15:13:14] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:13:15] <_joe_>	 yeah that's a bit too much I'd say :)
[15:13:54] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:14:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:14:18] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:14:20] <wikibugs>	 (03CR) 10TK-999: mcrouter: Specify missing CXXFLAGS (031 comment) [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584 (owner: 10TK-999)
[15:16:02] <wikibugs>	 (03CR) 10AOkoth: ats: add ticket-test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/957748 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[15:16:30] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:17:56] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:18:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1009.mgmt.eqiad.wmnet with reboot policy FORCED
[15:18:12] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1008.mgmt.eqiad.wmnet with reboot policy FORCED
[15:19:26] <wikibugs>	 (03PS2) 10Dr0ptp4kt: dr0ptp4kt WDQS, Search, Analytics access [puppet] - 10https://gerrit.wikimedia.org/r/958568 (https://phabricator.wikimedia.org/T346694)
[15:20:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:21:48] <wikibugs>	 (03PS1) 10Clément Goubert: mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330)
[15:22:35] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) (owner: 10Clément Goubert)
[15:22:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) (owner: 10Clément Goubert)
[15:24:47] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-on-k8s: Lower traffic to 3% [puppet] - 10https://gerrit.wikimedia.org/r/958955 (https://phabricator.wikimedia.org/T346330) (owner: 10Clément Goubert)
[15:25:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[15:25:19] <brennen>	 eoghan, jnuche: a half-hour downtime for all the phabricator.wikimedia.org stuff should be sufficient
[15:25:38] <brennen>	 also, just to confirm, current deploy server is deploy2002.eqiad.wmnet, yeah?
[15:25:38] <eoghan>	 Ok! I'll downtime that now if you're happy to go?
[15:25:50] <claime>	 !log reduce mw-on-k8s traffic to 3% waiting on new nodes - T346330
[15:25:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:55] <stashbot>	 T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330
[15:25:58] <claime>	 brennen: yes
[15:26:22] <eoghan>	 brennen: deploy2002.codfw.wmnet, not eqiad
[15:26:28] <claime>	 !log running puppet on 'A:cp-text and P{P:trafficserver::backend}' - T346330
[15:26:29] <brennen>	 er, yeah
[15:26:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:26:38] <claime>	 yeah sorry only looked at the number lol
[15:26:44] <brennen>	 claime: will i be stepping on your toes if we do a brief phab/phorge update now?
[15:27:00] <claime>	 brennen: I'd rather you wait a tad
[15:27:09] <_joe_>	 claime: lmk when scap finished
[15:27:12] <brennen>	 claime: ack.  we can push this one.  not an urgent update.
[15:27:54] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1007.mgmt.eqiad.wmnet with reboot policy FORCED
[15:28:16] <jnuche>	 brennen did you run the submodule update on deploy2002? I'm still seeing the old commits
[15:28:53] <claime>	 _joe_: It's deploying canaries, eqiad went fine but codfw is taking some time, afraid we're gonna run into the same capacity issue
[15:29:05] <_joe_>	 sigh
[15:29:59] <_joe_>	 claime: so a deployment won't work right now
[15:30:03] <_joe_>	 we need to solve this
[15:30:07] <claime>	 probably not
[15:30:20] <_joe_>	 let's reduce the number of replicas?
[15:30:26] <_joe_>	 would that help a bit?
[15:30:45] <wikibugs>	 (03PS5) 10David Caro: openstack::util::patch: add define [puppet] - 10https://gerrit.wikimedia.org/r/958931
[15:30:46] <claime>	 Since we reduced traffic, we can reduce replicas, it would help
[15:30:47] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloudservices1005: prepare for reimage and back into service [puppet] - 10https://gerrit.wikimedia.org/r/958915 (https://phabricator.wikimedia.org/T346042)
[15:30:54] <wikibugs>	 (03CR) 10David Caro: openstack::util::patch: add define (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[15:31:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10MGerlach)
[15:31:11] <_joe_>	 I'm not sure I understand fully what's the issue and why just on canaries
[15:31:33] <claime>	 _joe_: because only 2 replicas and rolling restarts
[15:31:39] <_joe_>	 jayme: can you please help with finding the solution to that problem? I have meetings
[15:31:58] <_joe_>	 claime: so going to say 4 replicas there and reducing by 2 the main pool would help?
[15:32:16] <akosiaris>	 no
[15:32:50] <jayme>	 we could change the rollingupdate strategy for that deployment I guess
[15:33:00] <akosiaris>	 the issue is 25% of the new replicaset needs to be spun up and deemed healthy before 25% of the previous replicaset is killed
[15:33:00] <jinxer-wm>	 (HelmReleaseBadStatus) firing: (4) Helm release mw-api-ext/canary on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:33:11] <akosiaris>	 and from there doing it in 25% steps
[15:33:21] <akosiaris>	 so increasing the pods of canary would make it worse 
[15:33:35] <akosiaris>	 if it is a no node found to satisfy the requests thing
[15:33:48] <akosiaris>	 if it is a quota thing... why do we have quotas for mediawiki ? 
[15:33:53] <akosiaris>	 it's like THE ONE APP we got
[15:34:19] <claime>	 It's no node found for scheduling
[15:34:23] <claime>	 It's not quotas I don't think
[15:34:36] <akosiaris>	 ok, so yeah increasing pod # wouldn't help
[15:34:39] <claime>	 FailedScheduling is available resources
[15:34:41] <jayme>	 ah, so it's overall capacity
[15:34:58] <akosiaris>	 the easy answer for this week is the thumbor trick
[15:35:13] <jayme>	 or how the cluster is currently packed (density)
[15:35:45] <akosiaris>	 but I don't see why this impacts just canaries
[15:35:51] <akosiaris>	 it should impact all mw deployments
[15:36:00] <akosiaris>	 and the ones with more pods should be impacted more
[15:36:00] <jayme>	 just by chance, no?
[15:36:12] <akosiaris>	 is it just that canaries are listed first as a release?
[15:36:18] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubestagemaster2002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:36:21] <akosiaris>	 or just chance as jayme says?
[15:36:30] <jayme>	 could we disable the canary release until thursday?
[15:36:54] <claime>	 akosiaris: jayme: https://phabricator.wikimedia.org/P52534
[15:37:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:37:32] <claime>	 *growls*
[15:37:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10MGerlach) Hi, @AKhatun_WMF is a returning staff member working with us (Research) as a contractor on a new [[ https://meta.wikimedia.org/wiki/Research:Improving_multili...
[15:38:06] <claime>	 I can't do anything about it right now, scap is rolling back, and again encountering teh same deployment issue because there's just no resources
[15:38:56] <claime>	 We need to scale back the main releases, deploy them *first*, then deploy the canaries
[15:39:04] <_joe_>	 yes
[15:39:06] <akosiaris>	 ok, so we got a couple of options
[15:39:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting shell access, deployment and analytics-privatedata-users rights for acooper - https://phabricator.wikimedia.org/T345877 (10thcipriani) >>! In T345877#9174389, @Vgutierrez wrote: > Thanks!, still blocked on @thcipriani for deployment group membershi...
[15:39:34] <_joe_>	 btw we have all the time until the next backport window
[15:39:49] <akosiaris>	 * scale back main
[15:39:49] <akosiaris>	 * the trick the scheduler trick
[15:39:49] <akosiaris>	 * undeploy canaries
[15:39:58] <claime>	 Undeploy is not an option
[15:40:03] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:40:03] <claime>	 scap relies on canary releases working
[15:40:10] <akosiaris>	 ok, scratching it out. 
[15:40:21] <_joe_>	 I would suggest not to scale back mw-api-int though
[15:40:24] <wikibugs>	 (03PS1) 10Muehlenhoff: scap:ferm: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/958962
[15:40:30] <_joe_>	 it's taking all the traffic in codfw right now
[15:40:33] <claime>	 scale back main mw-api-ext and mw-web
[15:40:39] <_joe_>	 ack
[15:40:44] <_joe_>	 I vote go
[15:40:46] <wikibugs>	 (03CR) 10Vgutierrez: "access got approved :)" [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez)
[15:40:53] <claime>	 ok we've broken through helmfile failures in scap, it's doing bare metal now
[15:40:59] <claime>	 Looks to be doing all right
[15:41:37] <akosiaris>	 can we scale down any kind of other workload ? 
[15:42:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:42:19] <_joe_>	 akosiaris: not sure, probably yes
[15:42:21] <claime>	 we should honestly just de-deploy mw-jobrunner
[15:42:31] <claime>	 It's just 1+1 replicas but they're functionally useless
[15:42:34] * akosiaris looking
[15:42:37] <claime>	 Not sure how to remove them from scap rn
[15:42:45] <_joe_>	 claime: it's irrelevant and not sure how to do that right now though
[15:43:00] <jinxer-wm>	 (HelmReleaseBadStatus) resolved: (4) Helm release mw-api-ext/canary on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency  - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus
[15:43:02] <wikibugs>	 (03PS2) 10Vgutierrez: admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877)
[15:43:05] <claime>	 _joe_: What is irrelevant?
[15:43:16] <jayme>	 why irrelevant? it frees space for 2 pods
[15:43:17] <_joe_>	 mw-jobrunner
[15:43:28] <_joe_>	 it's much more work than anything else to remove that
[15:43:31] <_joe_>	 it has LVS
[15:43:47] <_joe_>	 and we can reduce the replicas more
[15:43:51] <wikibugs>	 (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963
[15:43:58] <claime>	 _joe_: ^
[15:44:01] <jayme>	 ah...lvs is a good point
[15:44:20] <_joe_>	 we have elevated errors from mw-on-k8s
[15:44:23] <_joe_>	 what's going on?
[15:44:50] <_joe_>	 ah I think it's related to the scap failures
[15:45:03] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[15:45:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:45:22] <icinga-wm>	 RECOVERY - Check systemd state on restbase1030 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:45:23] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[15:45:25] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[15:46:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] admin: Grant shell access to acooper [puppet] - 10https://gerrit.wikimedia.org/r/955940 (https://phabricator.wikimedia.org/T345877) (owner: 10Vgutierrez)
[15:46:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:46:36] <jayme>	 another easy hack could be to (temporarily) remove the taint from the kask nodes
[15:46:45] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 40m 44s)
[15:46:48] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/958962 (owner: 10Muehlenhoff)
[15:46:52] <claime>	 _joe_: scap done
[15:46:54] <_joe_>	 those are ganeti nodes, we don't want mw there
[15:47:14] <_joe_>	 seriously, let's reduce the external traffic to even 1% if needed
[15:47:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (4) Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[15:47:18] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1019 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:47:20] <_joe_>	 and let's reduce the number of replicas
[15:47:25] <claime>	 I've reduced it to 3
[15:47:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 (owner: 10Clément Goubert)
[15:47:51] <jayme>	 ack
[15:48:09] <claime>	 And I'm waiting on alex to scrounge up some resources before scaling back replicas
[15:48:24] <claime>	 This high error rate though, what
[15:48:29] <_joe_>	 https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?var-datasource=codfw%20prometheus%2Fops&orgId=1&viewPanel=18 doesn't look good
[15:48:41] <_joe_>	 it happened with the deployment I think
[15:48:46] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware: decommission wdqs100[3,4].eqiad.wmnet - https://phabricator.wikimedia.org/T346699 (10Gehel)
[15:48:46] <claime>	 yeah
[15:48:49] <_joe_>	 on api
[15:48:55] <_joe_>	 let's look at logstash
[15:48:57] <claime>	 I'm gonna do the scale back anyways, akosiaris 
[15:49:05] <_joe_>	 I think ti's a version mismatch of some kind
[15:49:09] <claime>	 'cause it could be the version mismatch
[15:49:11] <claime>	 heh
[15:49:22] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 (owner: 10Clément Goubert)
[15:49:49] <claime>	 Give me a few minutes to square up the mw-on-k8s releases
[15:50:04] <jayme>	 "DBQueryError: Error 1146: Table 'wikidatawiki.revision_comment_temp' doesn't exist" 
[15:50:08] <jayme>	 for various wikis
[15:50:10] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1019 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:50:11] <wikibugs>	 (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Scale back main to 10 replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/958963 (owner: 10Clément Goubert)
[15:50:54] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Scale down replicas of various services [deployment-charts] - 10https://gerrit.wikimedia.org/r/958965
[15:51:06] <_joe_>	   Wikimedia\Rdbms\DBQueryError: Error 1146: Table 'wikidatawiki.revision_comment_temp' doesn't exist
[15:51:15] <jayme>	 that I said :)
[15:51:29] <jynus>	 since 15:33
[15:51:37] <_joe_>	 jynus: we're aware
[15:51:41] <_joe_>	 it's related to the scap issues
[15:51:45] <claime>	 scaling back main
[15:51:49] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[15:51:50] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[15:51:51] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply
[15:51:54] <akosiaris>	 https://gerrit.wikimedia.org/r/c/operations/deployment-charts/+/958965
[15:52:00] <jayme>	 is that something created by scap?
[15:52:05] <akosiaris>	 I 'll deploy the bigs ones immediately
[15:52:05] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply
[15:52:06] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply
[15:52:07] <_joe_>	 claime: it's more important on mw-api-*
[15:52:12] <_joe_>	  to redeploy
[15:52:18] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:52:18] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply
[15:52:19] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply
[15:52:22] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply
[15:52:23] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply
[15:52:27] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply
[15:52:28] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply
[15:52:30] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] Scale down replicas of various services [deployment-charts] - 10https://gerrit.wikimedia.org/r/958965 (owner: 10Alexandros Kosiaris)
[15:52:40] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply
[15:52:41] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply
[15:52:53] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply
[15:52:54] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-misc: apply
[15:52:56] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-misc: apply
[15:52:57] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-misc: apply
[15:53:01] <logmsgbot>	 !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-misc: apply
[15:53:08] <claime>	 ok, main scaled down
[15:53:12] <claime>	 scapping a k8s redeploy
[15:53:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:53:38] <_joe_>	 Amir1: around?
[15:53:43] <Amir1>	 yes
[15:53:46] <_joe_>	 any idea why that error would show up?
[15:53:50] <logmsgbot>	 !log cgoubert@deploy2002 Started scap: (no justification provided)
[15:53:54] <_joe_>	 the wikidata temp table
[15:53:57] <jayme>	 DBQueryError: Error 1146: Table 'wikidatawiki.revision_comment_temp' doesn't exist
[15:54:01] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply
[15:54:04] <jayme>	 it's for different wikis AIUI
[15:54:11] <wikibugs>	 (03PS1) 10AOkoth: ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/958987 (https://phabricator.wikimedia.org/T340027)
[15:54:15] <Amir1>	 that is old code
[15:54:15] <jayme>	 not only wikidata
[15:54:25] <Amir1>	 somehow old code is showing up?
[15:54:25] <akosiaris>	 !log scaling down mobileapps, wikifeeds, mathoid, similar-users
[15:54:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:54:39] <akosiaris>	 deployment host having old code?
[15:54:45] <_joe_>	 Amir1: yeah I fear it's because of the deployment host
[15:54:45] <Amir1>	 or old config
[15:54:54] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply
[15:54:59] <_joe_>	 Amir1: can you check mediawiki-staging on deploy2002?
[15:55:05] <Amir1>	 sure
[15:55:05] <claime>	 It would have pushed that code to bare metal too wouldn't it?
[15:55:14] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/mathoid: apply
[15:55:20] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply
[15:55:22] <claime>	 Do we have those errors from bare metal as well
[15:55:24] <claime>	 ?
[15:55:30] <_joe_>	 claime: I don't think so
[15:55:39] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/similar-users: apply
[15:55:49] <claime>	 Ok, scap k8s deployment going well
[15:55:57] <claime>	 canaries are good
[15:56:01] <Amir1>	 the config in mediawiki-staging is ok
[15:56:06] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/similar-users: apply
[15:56:06] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:56:09] <_joe_>	 jayme: are errors goiong down?
[15:56:15] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/similar-users: apply
[15:56:15] <jayme>	 I don't see those errors for metal
[15:56:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[15:56:47] <jayme>	 _joe_: I'd say yes
[15:56:57] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/similar-users: apply
[15:57:02] <akosiaris>	 claime: a few hosts below 90% now in codfw
[15:57:03] <logmsgbot>	 !log cgoubert@deploy2002 Finished scap: (no justification provided) (duration: 03m 12s)
[15:57:04] <claime>	 ok so it was version mismatch between releases
[15:57:05] <_joe_>	 ok so I fear I know what happened
[15:57:15] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply
[15:57:19] <_joe_>	 the scap rollback tries to be very smart
[15:57:21] <_joe_>	 and it should not
[15:57:23] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply
[15:57:38] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/mathoid: apply
[15:57:41] <Amir1>	 more context T215466
[15:57:42] <stashbot>	 T215466: Remove revision_comment_temp and revision_actor_temp - https://phabricator.wikimedia.org/T215466
[15:57:46] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply
[15:57:56] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:57:57] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/958987 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[15:58:03] <wikibugs>	 (03PS2) 10Andrea Denisse: prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656)
[15:58:07] <wikibugs>	 (03CR) 10AOkoth: [V: 03+2 C: 03+2] ticket-test: add dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/958987 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[15:58:10] <jayme>	 errors gone 
[15:58:12] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply
[15:58:14] <_joe_>	 yep
[15:58:17] <_joe_>	 errors gone
[15:58:20] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply
[15:58:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[15:58:37] <_joe_>	 ok, crisis averted
[15:58:38] <jynus>	 what was it?
[15:58:38] <claime>	 ok we're fine
[15:58:44] <_joe_>	 jynus: not now
[15:58:46] <jynus>	 ok
[15:58:53] <Amir1>	 and T299954
[15:58:53] <stashbot>	 T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954
[15:59:02] <jayme>	 did scap roll back a helm rollback?
[15:59:06] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/wikifeeds: apply
[15:59:11] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply
[15:59:13] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt)
[15:59:37] <Amir1>	 fwiw, the change happened months ago T299954#8930695
[15:59:47] <Amir1>	 how did we rollback to that version?
[15:59:47] <jayme>	 :-o
[15:59:50] <akosiaris>	 ok, multiple hosts now below 90% in wikikube@codfw
[16:00:05] <jouncebot>	 jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1600). Please do the needful.
[16:00:06] <jouncebot>	 Urbanecm: A patch you scheduled for Puppet request window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[16:00:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10dr0ptp4kt) @Gehel I added `airflow-search-admins` to the ticket description and amended the patch, after David said it might be something not need...
[16:00:17] * urbanecm waves
[16:00:18] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] traffic: Depool eqiad from user traffic for switchover [dns] - 10https://gerrit.wikimedia.org/r/958920 (https://phabricator.wikimedia.org/T346330) (owner: 10Kamila Součková)
[16:00:29] <Amir1>	 urbanecm: we have some fun issues right now
[16:00:46] <urbanecm>	 ack, can wait.
[16:01:32] <Amir1>	 it might be some bug in the code that make it crop up again but we even removed that code a couple weeks ago https://gerrit.wikimedia.org/r/c/mediawiki/core/+/929717/
[16:01:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:01:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10Gehel) >>! In T346694#9179527, @dr0ptp4kt wrote: > @Gehel I added `airflow-search-admins` to the ticket description and amended the patch, after D...
[16:02:14] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: vrts1002 change global_cert_name [puppet] - 10https://gerrit.wikimedia.org/r/958565 (https://phabricator.wikimedia.org/T340027) (owner: 10AOkoth)
[16:02:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (4) Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:03:26] <wikibugs>	 (03PS3) 10Andrea Denisse: prometheus: Prevent Prometheus from scrapping certain statsd-exporters [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656)
[16:03:30] <btullis>	 We are doing some testing with flink-zk1001 in case any alerts arrive - you can ignore them.
[16:04:16] <icinga-wm>	 PROBLEM - Zookeeper Server on flink-zk1001 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[16:04:24] <btullis>	 ^expected
[16:04:33] <kamila_>	 !log DC Switchover: traffic - T346330
[16:04:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:04:39] <stashbot>	 T346330: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330
[16:06:30] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:06:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST events) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[16:08:34] <wikibugs>	 (03PS2) 10TK-999: mcrouter: Specify missing CXXFLAGS [debs/mcrouter] - 10https://gerrit.wikimedia.org/r/860584
[16:09:10] <Amir1>	 _joe_: let me know if I can help on anything. I can't find any trace of that code in deploy2002
[16:09:38] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[16:10:24] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:10:49] <jayme>	 Amir1: he's in a meeting currently and I think we're currently safe again
[16:11:28] <Amir1>	 jayme: awesome. I go to other stuff, ping me if needed
[16:11:33] <jayme>	 I'm not sure what exectly happened tbh but it seems j.oe has a theory :)
[16:11:45] <jayme>	 sure thing, thanks!
[16:12:12] <icinga-wm>	 PROBLEM - Disk space on krb1001 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=97%): /tmp 0 MB (0% inode=97%): /var/tmp 0 MB (0% inode=97%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[16:12:48] <claime>	 Looks like kerberos ate all of its kibble
[16:13:38] <claime>	 btullis: ^
[16:14:30] <icinga-wm>	 PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:14:44] <wikibugs>	 (03PS1) 10Btullis: Add the analytics and search-pltform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792)
[16:15:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:16:21] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/43397/console" [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis)
[16:16:22] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:16:40] <icinga-wm>	 RECOVERY - Zookeeper Server on flink-zk1001 is OK: PROCS OK: 1 process with command name java, args org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/zookeeper/conf/zoo.cfg https://wikitech.wikimedia.org/wiki/Zookeeper
[16:20:19] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] Add the analytics and search-pltform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis)
[16:20:23] <btullis>	 claime: Thanks. Looking now.
[16:20:27] <wikibugs>	 (03PS2) 10Ryan Kemper: Add the analytics and search-platform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis)
[16:22:17] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "deployment process: https://phabricator.wikimedia.org/P49715" [puppet] - 10https://gerrit.wikimedia.org/r/928459 (https://phabricator.wikimedia.org/T316982) (owner: 10Majavah)
[16:23:18] <wikibugs>	 10SRE, 10Data-Engineering, 10Infrastructure-Foundations: krb1001: krb5kdc.log excessive size - https://phabricator.wikimedia.org/T337906 (10BTullis) Disk usage hit 100% and I did this again: ` btullis@krb1001:~$ sudo truncate -s 10000 /var/log/kerberos/krb5kdc.log `  This was the size beforehand.  ` btullis@...
[16:23:22] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/958807/43396/" [puppet] - 10https://gerrit.wikimedia.org/r/958807 (https://phabricator.wikimedia.org/T346656) (owner: 10Andrea Denisse)
[16:23:39] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add the analytics and search-platform teams to flink zk contacts [puppet] - 10https://gerrit.wikimedia.org/r/958991 (https://phabricator.wikimedia.org/T341792) (owner: 10Btullis)
[16:23:58] <icinga-wm>	 PROBLEM - Check systemd state on krb1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-debian-version-textfile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:24:44] <claime>	 urbanecm: I can deploy your puppet patch now if you want, but in exchange tell me if you know why startupregistrystats-testwiki could be failing since ~1200 UTC today :p
[16:25:22] <icinga-wm>	 RECOVERY - Check systemd state on krb1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:25:40] <urbanecm>	 claime: thanks, puppet patch deploy would be helpful. We deployed wmf.27 to testwiki this morning, so maybe that?
[16:26:12] <claime>	 urbanecm: https://phabricator.wikimedia.org/P52535
[16:26:18] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] growthexperiments: Run listTaskCounts for all task types [puppet] - 10https://gerrit.wikimedia.org/r/953344 (https://phabricator.wikimedia.org/T345204) (owner: 10Urbanecm)
[16:26:23] <urbanecm>	 (wmf.27 deploy happened at ~4 UTC, so...maybe not)
[16:27:02] <urbanecm>	 okay, i blame .27 preliminarily. i can check later today. do we have a task?
[16:27:30] <claime>	 not yet, just found out, it started alerting at 1614
[16:28:33] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila)
[16:28:39] <claime>	 !log Deployed https://gerrit.wikimedia.org/r/953344 - T345204
[16:28:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:28:45] <stashbot>	 T345204: Alert the Growth team when number of available task recommendations drops significantly - https://phabricator.wikimedia.org/T345204
[16:28:52] <claime>	 Running puppet on mwmaint1002 and you'll be good
[16:29:14] <urbanecm>	 ty
[16:29:29] <claime>	 done
[16:31:33] <wikibugs>	 10SRE, 10serviceops, 10Datacenter-Switchover: Sept 2023 Switchover Checklist: Services & Traffic - https://phabricator.wikimedia.org/T346330 (10kamila) 05Open→03Resolved While there are some outstanding issues due to lack of capacity in codfw, overall we're done here :-)
[16:31:38] <wikibugs>	 10SRE, 10Data-Persistence, 10Performance-Team, 10serviceops, 10Datacenter-Switchover: September 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T345263 (10kamila)
[16:31:40] <claime>	 urbanecm: Actually it started failing even earlier
[16:31:43] <claime>	 Sep 19 04:10:30 mwmaint1002 systemd[1]: mediawiki_job_startupregistrystats-testwiki.service: Main process exited, code=exited, status=1/FAILURE
[16:31:47] <claime>	 I'm putting a task together
[16:32:11] <urbanecm>	 that's...even closer to the wmf.27 deploy
[16:32:36] <icinga-wm>	 RECOVERY - Disk space on krb1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=krb1001&var-datasource=eqiad+prometheus/ops
[16:32:41] <urbanecm>	 i'd mark that task as train blocker until it is investigated (by making it as a https://phabricator.wikimedia.org/T345888 subtask)
[16:34:48] <claime>	 https://phabricator.wikimedia.org/T346800
[16:35:39] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10thcipriani) >>! In T342535#9171415, @RLazarus wrote: > @thcipriani Sorry for the back-and-forth, but just because it isn't 100% explicit from reading t...
[16:39:54] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:41:31] <urbanecm>	 claime: is it intentional half of the paste disappeared? it looks like only left part of the stacktrace is there.
[16:41:38] <claime>	 urbanecm: ugh
[16:41:43] <claime>	 no not intentional
[16:41:46] <claime>	 tmux shenanigans
[16:42:23] <claime>	 urbanecm: should be good now
[16:42:26] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:42:26] <urbanecm>	 ty
[16:42:42] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:43:48] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:45:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://w
[16:45:46] <icinga-wm>	 wikimedia.org/wiki/Services/Monitoring/restbase
[16:48:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:50:25] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics and search resources for dr0ptp4kt - https://phabricator.wikimedia.org/T346694 (10odimitrijevic) Approved
[16:51:34] <wikibugs>	 10SRE, 10Cassandra: Allow Cassandra to be deployed on Bullseye nodes - https://phabricator.wikimedia.org/T310980 (10Eevans) 05Open→03Resolved a:03Eevans AFAIK, everything this issue aimed to solve has been (we are installing Cassandra on Bullseye).  Closing.
[16:52:28] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:53:11] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate sessionstore servers to Bullseye - https://phabricator.wikimedia.org/T331714 (10Eevans) 05Open→03Resolved a:03Eevans macro-deployed
[16:55:20] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[16:55:49] <urbanecm>	 claime: as for why it happens, see https://phabricator.wikimedia.org/T346800#9179846 :).
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1700)
[17:03:21] <wikibugs>	 (03CR) 10Herron: [C: 03+1] Improve ML team's SLO calculations [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/958897 (https://phabricator.wikimedia.org/T327620) (owner: 10Elukey)
[17:03:47] <wikibugs>	 (03CR) 10Herron: [C: 03+1] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[17:04:05] <wikibugs>	 (03PS1) 10Jdlrobson: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459)
[17:09:56] <icinga-wm>	 PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga
[17:13:58] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1085 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:15:24] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:16:42] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics_privatedata_users, deployment_members for Mabualruz - https://phabricator.wikimedia.org/T342535 (10Mabualruz) I am happy to attend another training session with access so I can try to gain some hands on experience.
[17:20:21] <wikibugs>	 (03CR) 10Ejegg: Allow FundraiseUp scripts in Donatewiki CSP (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957983 (https://phabricator.wikimedia.org/T345379) (owner: 10Ejegg)
[17:23:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Aisha Khatun - https://phabricator.wikimedia.org/T346796 (10KFrancis) Hello all, I am confirming Aisha Khatun has a NDA on file.  Thank you!
[17:23:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[17:39:56] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:42:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[17:51:21] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Aklapper) @Ahoelzl: I apologize, my previous comments were likely confusing. (You cannot reset a password on mediawiki.org as it is a global SUL account  and thus resets wou...
[17:51:35] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "This sent me on a quest to figure out what the difference is between the named arg --input and the positional patch file but the docs abso" [puppet] - 10https://gerrit.wikimedia.org/r/958931 (owner: 10David Caro)
[17:54:51] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[18:00:05] <jouncebot>	 brennen and jnuche: I, the Bot under the Fountain, call upon thee, The Deployer, to do MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1800).
[18:00:14] <brennen>	 o/
[18:00:18] <brennen>	 train currently blocked.
[18:06:04] <wikibugs>	 (03PS1) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[18:06:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth)
[18:08:11] <wikibugs>	 (03PS2) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[18:08:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026 (owner: 10AOkoth)
[18:09:33] <wikibugs>	 (03PS3) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[18:10:39] <wikibugs>	 (03CR) 10RobH: [C: 03+2] add pki1002 to T342892 [puppet] - 10https://gerrit.wikimedia.org/r/958943 (https://phabricator.wikimedia.org/T342892) (owner: 10Jclark-ctr)
[18:23:09] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:23:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[18:26:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:27:15] <wikibugs>	 (03PS11) 10Ebernhardson: Draft: cirrus streaming updater service [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960
[18:27:17] <wikibugs>	 (03CR) 10Ebernhardson: Draft: cirrus streaming updater service (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/951960 (owner: 10Ebernhardson)
[18:28:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:29:34] <icinga-wm>	 PROBLEM - Check systemd state on gitlab-runner1002 is CRITICAL: CRITICAL - degraded: The following units failed: docker-gc.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:30:58] <icinga-wm>	 RECOVERY - Check systemd state on gitlab-runner1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:33:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:34:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:36:51] <wikibugs>	 (03PS1) 10Jforrester: Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800)
[18:44:19] <wikibugs>	 (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/958968
[18:45:51] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) (owner: 10Jforrester)
[18:47:37] <wikibugs>	 (03PS1) 10Andrew Bogott: dbproxy1018: depool clouddb1019 in favor of clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/959036 (https://phabricator.wikimedia.org/T346826)
[18:49:31] <brennen>	 James_F: thanks for revert, i'll deploy that one.
[18:49:48] <James_F>	 brennen: YW.
[18:52:07] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) (owner: 10Jforrester)
[18:52:13] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] dbproxy1018: depool clouddb1019 in favor of clouddb1015 [puppet] - 10https://gerrit.wikimedia.org/r/959036 (https://phabricator.wikimedia.org/T346826) (owner: 10Andrew Bogott)
[18:52:17] <wikibugs>	 (03PS4) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[18:52:32] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:53:58] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[18:54:07] <brennen>	 Jdlrobson: talk to me about T342277 - is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/959035/ just needed before train can roll?
[18:54:08] <stashbot>	 T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277
[19:00:11] <wikibugs>	 (03PS5) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[19:02:46] <wikibugs>	 (03PS6) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[19:04:29] <wikibugs>	 (03PS1) 10Jforrester: Wikifunctions: Update evaluator image to 2023-09-19-183305 [deployment-charts] - 10https://gerrit.wikimedia.org/r/959037
[19:04:45] <wikibugs>	 (03PS1) 10Eevans: Be explicit about the yaml loader class [debs/cassandra-tools-wmf] - 10https://gerrit.wikimedia.org/r/959038
[19:06:30] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" [core] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959009 (https://phabricator.wikimedia.org/T346800) (owner: 10Jforrester)
[19:07:18] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:959009|Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" (T346800)]]
[19:07:25] <stashbot>	 T346800: startupregistrystats-testwiki periodic job fails - https://phabricator.wikimedia.org/T346800
[19:08:18] <wikibugs>	 (03PS1) 10Gmodena: data-engineering: eventgate: standardize alerts [alerts] - 10https://gerrit.wikimedia.org/r/959039 (https://phabricator.wikimedia.org/T326002)
[19:14:14] <wikibugs>	 (03PS1) 10Jdlrobson: Disable client preferences cog in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363)
[19:15:05] <wikibugs>	 (03PS1) 10Jdlrobson: Fixes cannot read properties of undefined [extensions/MobileFrontend] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959010 (https://phabricator.wikimedia.org/T342277)
[19:15:07] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable client preferences cog in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson)
[19:16:19] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "To make clear here as well as on Phabricator, there are no objections from the development team. Would you like me to deploy this, or shou" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/947495 (https://phabricator.wikimedia.org/T343946) (owner: 10Mdaniels5757)
[19:16:48] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "As with I3d1115e97, there are no objections from the development team. Would you like me to deploy this, or should I leave it to you to sc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/948196 (https://phabricator.wikimedia.org/T344085) (owner: 10Mdaniels5757)
[19:20:09] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host pc1015
[19:21:34] <logmsgbot>	 !log vriley@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host pc1015
[19:23:00] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:24:24] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:24:26] <logmsgbot>	 !log vriley@cumin1001 START - Cookbook sre.hosts.provision for host pc1015.mgmt.eqiad.wmnet with reboot policy FORCED
[19:29:39] <logmsgbot>	 !log brennen@deploy2002 jforrester and brennen: Backport for [[gerrit:959009|Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" (T346800)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[19:29:45] <stashbot>	 T346800: startupregistrystats-testwiki periodic job fails - https://phabricator.wikimedia.org/T346800
[19:31:50] <logmsgbot>	 !log brennen@deploy2002 jforrester and brennen: Continuing with sync
[19:33:17] <wikibugs>	 (03CR) 10Kimberly Sarabia: Disable client preferences cog in production (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson)
[19:33:43] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add kafka-jumbo10[11-15].eqiad.wmnet to the apps broker list [deployment-charts] - 10https://gerrit.wikimedia.org/r/958938 (https://phabricator.wikimedia.org/T336041) (owner: 10Brouberol)
[19:37:24] <wikibugs>	 (03PS1) 10Majavah: Set READ_NEW for Wikitech on OATHAuth multiple devices migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959042 (https://phabricator.wikimedia.org/T242031)
[19:37:26] <wikibugs>	 (03PS1) 10Majavah: Set WRITE_NEW for OATHAuth multiple devices on fishbowls/privates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959043 (https://phabricator.wikimedia.org/T242031)
[19:37:27] <taavi>	 jouncebot: nowandnext
[19:37:27] <jouncebot>	 For the next 0 hour(s) and 22 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T1800)
[19:37:27] <jouncebot>	 In 0 hour(s) and 22 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T2000)
[19:38:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:39:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:41:23] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10VRiley-WMF) pc1016 - C 6. U 31. port 30 CableID 3252 is having issues, will recheck cabling
[19:41:48] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[19:43:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[19:44:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:46:22] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence, 10Platform Engineering: Migrate restbase servers to Bullseye - https://phabricator.wikimedia.org/T331713 (10Eevans)
[19:47:54] <wikibugs>	 (03PS1) 10Jdlrobson: Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345664)
[19:48:01] <wikibugs>	 (03Abandoned) 10Jdlrobson: Disable client preferences cog in production [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959040 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson)
[19:48:05] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:959009|Revert "ResourceLoader: Set 'virtualFilePath' for startup.js" (T346800)]] (duration: 40m 46s)
[19:48:10] <wikibugs>	 10SRE, 10Cassandra, 10Data-Persistence: Migrate cassandra-dev to Bullseye - https://phabricator.wikimedia.org/T331711 (10Eevans) 05Open→03Resolved a:03Eevans macro-deployed
[19:48:15] <stashbot>	 T346800: startupregistrystats-testwiki periodic job fails - https://phabricator.wikimedia.org/T346800
[19:48:38] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:50:02] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[19:51:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[19:52:35] <wikibugs>	 (03PS2) 10Jdlrobson: Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345363)
[19:56:00] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[19:56:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-web (canary) at codfw - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-release=canary&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T2000).
[20:00:05] <jouncebot>	 Jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:23] <TheresNoTime>	 (unavailable this evening, sorry)
[20:00:53] <Jdlrobson>	 @brennen around per your phab comment? 2 of these 3 are train blockers
[20:01:45] <brennen>	 Jdlrobson: yep
[20:01:52] <brennen>	 shall we just go in order?
[20:02:04] <urbanecm>	 i can deploy too
[20:02:10] <Jdlrobson>	 brennen: just waiting on CI and a review from a team mate on 959013 so that should be later in the deploy window
[20:02:13] <urbanecm>	 or maybe brennen's on it?
[20:02:16] <brennen>	 (i guess also: can any of these go out together)
[20:02:24] <brennen>	 urbanecm: i can handle this one, doing the train today anyhow
[20:02:32] <urbanecm>	 ack, ty.
[20:02:58] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MobileFrontend/+/959010/ can o first
[20:03:22] <wikibugs>	 (03CR) 10Jdlrobson: "recheck" [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[20:03:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy2002 using scap backport" [extensions/MobileFrontend] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959010 (https://phabricator.wikimedia.org/T342277) (owner: 10Jdlrobson)
[20:08:15] <Jdlrobson>	 https://gerrit.wikimedia.org/r/c/mediawiki/skins/Vector/+/959013 should be ready
[20:10:24] <icinga-wm>	 RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:10:27] <Jdlrobson>	 ^ brennen 
[20:10:38] <brennen>	 ack, waiting on previous patch.
[20:11:03] <Jdlrobson>	 👍
[20:17:28] <wikibugs>	 (03Merged) 10jenkins-bot: Fixes cannot read properties of undefined [extensions/MobileFrontend] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959010 (https://phabricator.wikimedia.org/T342277) (owner: 10Jdlrobson)
[20:18:02] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:959010|Fixes cannot read properties of undefined (T342277)]]
[20:18:08] <stashbot>	 T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277
[20:23:44] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 81, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:24:16] <wikibugs>	 (03CR) 10Brennen Bearnes: [C: 03+2] Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson)
[20:24:36] <brennen>	 (just realized i could +2 that one to get tests moving.)
[20:26:05] <wikibugs>	 (03PS1) 10Fabfur: WIP: add Dockerfile just for build [software/purged] - 10https://gerrit.wikimedia.org/r/959049
[20:26:23] <wikibugs>	 (03PS1) 10Fabfur: allow to specify buffer size for backend, frontend or both [software/purged] - 10https://gerrit.wikimedia.org/r/959050
[20:32:20] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[20:34:35] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudelastic1007-10 - jclark@cumin1001"
[20:35:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt cloudelastic1007-10 - jclark@cumin1001"
[20:35:22] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:36:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cloudelastic1010
[20:36:04] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cloudelastic1010
[20:36:38] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 83, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:37:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED
[20:38:02] <wikibugs>	 (03Merged) 10jenkins-bot: Disable client preferences by default [skins/Vector] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959013 (https://phabricator.wikimedia.org/T345363) (owner: 10Jdlrobson)
[20:38:34] <logmsgbot>	 !log brennen@deploy2002 jdlrobson and brennen: Backport for [[gerrit:959010|Fixes cannot read properties of undefined (T342277)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[20:38:40] <stashbot>	 T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277
[20:39:01] <brennen>	 Jdlrobson: anything to check on this one?
[20:39:47] <Jdlrobson>	 yeh i can verify on test wiki
[20:40:40] <Jdlrobson>	 brennen: ^
[20:41:31] <brennen>	 Jdlrobson: ack, i await your signal.
[20:41:43] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr)
[20:41:54] <Jdlrobson>	 brennen: is it on debug servers? I'm not seeing the changes
[20:42:28] <Jdlrobson>	 brennen: ah now i am :)
[20:42:37] <Jdlrobson>	 the MobileFrontend is good to go
[20:42:39] <logmsgbot>	 !log brennen@deploy2002 jdlrobson and brennen: Continuing with sync
[20:42:42] <brennen>	 cool, goin'
[20:42:47] <Jdlrobson>	 i'm not seeing the Vector one yet
[20:42:50] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:43:02] <Jdlrobson>	 presumably that's not synced yet?
[20:43:15] <brennen>	 yeah, not yet.
[20:44:16] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:44:36] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[20:50:17] <logmsgbot>	 !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@b603e64]: (no justification provided)
[20:50:27] <logmsgbot>	 !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) (duration: 00m 09s)
[20:50:42] <brennen>	 jouncebot nowandnext
[20:50:42] <jouncebot>	 For the next 0 hour(s) and 9 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230919T2000)
[20:50:42] <jouncebot>	 In 9 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230920T0600)
[20:51:40] <logmsgbot>	 !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@b603e64]: (no justification provided)
[20:51:46] <logmsgbot>	 !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) (duration: 00m 05s)
[20:54:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr)
[20:55:41] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:959010|Fixes cannot read properties of undefined (T342277)]] (duration: 37m 39s)
[20:55:48] <stashbot>	 T342277: Minerva font size setting should use new client side preferences - https://phabricator.wikimedia.org/T342277
[20:55:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['pki1002']
[20:57:03] <logmsgbot>	 !log brennen@deploy2002 Started scap: Backport for [[gerrit:959013|Disable client preferences by default (T345363)]]
[20:57:07] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['pki1002']
[20:57:09] <stashbot>	 T345363: Create font size settings interface functionality for vector - https://phabricator.wikimedia.org/T345363
[20:57:39] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (10Jclark-ctr) a:03Jclark-ctr
[20:58:24] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[20:58:55] <brennen>	 this one should be a bit faster since it's already merged.
[20:59:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1007']
[20:59:50] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:01:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudelastic1010.mgmt.eqiad.wmnet with reboot policy FORCED
[21:01:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1009']
[21:03:00] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for ahoelzl - https://phabricator.wikimedia.org/T345959 (10Ahoelzl) Thanks. With help of tech support I claimed my mediawiki.org AHoelzl-WMF account. It wasn't straightforward though ... I was able to link it to Phabricator.  Are yo...
[21:05:10] <brennen>	 Jdlrobson: correct in thinking https://gerrit.wikimedia.org/r/c/mediawiki/extensions/GrowthExperiments/+/959007/ shouldn't block?
[21:05:30] <Jdlrobson>	 brennen: yeh i think urbanecm said he could do this tomorrow
[21:05:35] <Jdlrobson>	 there's a CI issue on it
[21:05:39] <brennen>	 kk
[21:05:51] <Jdlrobson>	 ah urbanecm just got back to me about the CI issue
[21:05:55] <brennen>	 once the vector one finishes, i'll roll train forward.
[21:06:03] <Jdlrobson>	 but yeh I think this will have to wait until tomorrow
[21:06:08] <Jdlrobson>	 sorry urbanecm 
[21:07:15] <urbanecm>	 No worries. Can do it in the morning. 
[21:07:35] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1007']
[21:07:42] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1008']
[21:07:44] <urbanecm>	 But this error can be bypassed tbh. 
[21:11:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1009']
[21:11:31] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['cloudelastic1010']
[21:14:05] <wikibugs>	 (03PS1) 10Urbanecm: build: Update eslint-config-wikimedia to 0.25.1 [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959014 (https://phabricator.wikimedia.org/T346629)
[21:14:21] <wikibugs>	 (03PS2) 10Urbanecm: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[21:14:28] <wikibugs>	 (03PS3) 10Urbanecm: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[21:14:34] <wikibugs>	 (03PS4) 10Urbanecm: Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[21:16:00] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1008']
[21:16:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr)
[21:17:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudelastic10[07-10].wikimedia.org - https://phabricator.wikimedia.org/T342538 (10Jclark-ctr) @bking @RKemper  Please update when partman recipe in puppet repo  is finished
[21:17:26] <logmsgbot>	 !log brennen@deploy2002 jdlrobson and brennen: Backport for [[gerrit:959013|Disable client preferences by default (T345363)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
[21:17:32] <stashbot>	 T345363: Create font size settings interface functionality for vector - https://phabricator.wikimedia.org/T345363
[21:17:54] <brennen>	 Jdlrobson: ^ vector patch checkable?
[21:20:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['cloudelastic1010']
[21:21:48] <Jdlrobson>	 brennen: yep checking now
[21:22:22] <Jdlrobson>	 brennen: yep all good
[21:22:30] <Jdlrobson>	 please sync and roll forward the train!
[21:24:58] <brennen>	 cool, ty
[21:25:01] <logmsgbot>	 !log brennen@deploy2002 jdlrobson and brennen: Continuing with sync
[21:26:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[21:26:53] <wikibugs>	 (03CR) 10Jdlrobson: [C: 03+1] Change CSS selector for Minerva mobile menu icon [extensions/GrowthExperiments] (wmf/1.41.0-wmf.27) - 10https://gerrit.wikimedia.org/r/959007 (https://phabricator.wikimedia.org/T346459) (owner: 10Jdlrobson)
[21:28:58] <wikibugs>	 (03CR) 10Herron: [V: 03+1 C: 03+2] titan: switch to cfssl [puppet] - 10https://gerrit.wikimedia.org/r/956901 (owner: 10Herron)
[21:29:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt db12[26-33] - jclark@cumin1001"
[21:29:50] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr)
[21:29:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update  mgmt db12[26-33] - jclark@cumin1001"
[21:29:51] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:30:19] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1226
[21:30:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1227
[21:31:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1227
[21:31:26] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1229
[21:31:41] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1226
[21:31:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1230
[21:32:20] <logmsgbot>	 !log cmooney@cumin1001 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
[21:32:24] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[21:32:47] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1229
[21:32:57] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1230
[21:33:44] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:34:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1231
[21:34:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1232
[21:34:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1232
[21:35:09] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1233
[21:36:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1231
[21:36:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1233
[21:36:44] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200): /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRI
[21:36:44] <icinga-wm>	 est Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:37:04] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1232
[21:37:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host db1232
[21:37:49] <logmsgbot>	 !log brennen@deploy2002 Finished scap: Backport for [[gerrit:959013|Disable client preferences by default (T345363)]] (duration: 40m 45s)
[21:37:53] <Jdlrobson>	 thanks brennen. good luck with the train!
[21:37:54] <stashbot>	 T345363: Create font size settings interface functionality for vector - https://phabricator.wikimedia.org/T345363
[21:38:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[21:39:17] <wikibugs>	 (03PS7) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[21:39:44] <wikibugs>	 (03PS8) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[21:41:13] <brennen>	 !log train 1.41.0-wmf.27 (T345888): blockers resolved; rolling to group0
[21:41:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:41:25] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[21:41:28] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] o11y: complement prometheus alerting rules [alerts] - 10https://gerrit.wikimedia.org/r/958929 (owner: 10Filippo Giunchedi)
[21:41:42] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959058 (https://phabricator.wikimedia.org/T345888)
[21:41:44] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] remove dispatch dns record [dns] - 10https://gerrit.wikimedia.org/r/957799 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[21:41:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959058 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[21:42:33] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] dispatch: remove puppetization [puppet] - 10https://gerrit.wikimedia.org/r/957756 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[21:43:06] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] dispatch::web: add ensure param and ensure => absent [puppet] - 10https://gerrit.wikimedia.org/r/957749 (https://phabricator.wikimedia.org/T344937) (owner: 10Herron)
[21:43:09] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.41.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/959058 (https://phabricator.wikimedia.org/T345888) (owner: 10TrainBranchBot)
[21:43:44] <wikibugs>	 (03PS9) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[21:43:56] <wikibugs>	 (03PS10) 10AOkoth: vrts: add ticket-cert.crt [puppet] - 10https://gerrit.wikimedia.org/r/959026
[21:45:05] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] rsyslog: ingest 'excimer' logs from webperf to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/937504 (https://phabricator.wikimedia.org/T339137) (owner: 10Krinkle)
[21:45:39] <wikibugs>	 (03PS1) 10Ebernhardson: Draft: Pull some flink config down into the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315)
[21:45:40] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[21:46:58] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[21:47:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host db1232
[21:48:21] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1232
[21:49:45] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1226.mgmt.eqiad.wmnet with reboot policy FORCED
[21:49:46] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED
[21:49:48] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1230.mgmt.eqiad.wmnet with reboot policy FORCED
[21:49:50] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1228.mgmt.eqiad.wmnet with reboot policy FORCED
[21:49:51] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1227.mgmt.eqiad.wmnet with reboot policy FORCED
[21:50:35] <logmsgbot>	 !log brennen@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.41.0-wmf.27  refs T345888
[21:50:40] <stashbot>	 T345888: 1.41.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T345888
[21:51:34] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:51:51] <wikibugs>	 (03CR) 10Ebernhardson: "This will also need a puppet patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959059 (https://phabricator.wikimedia.org/T346315) (owner: 10Ebernhardson)
[21:56:34] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (DELETE pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[21:57:19] <wikibugs>	 (03PS2) 10Bking: rdf-streaming-updater: start adding per-env ZK path root [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149)
[21:58:51] <wikibugs>	 (03CR) 10Bking: rdf-streaming-updater: start adding per-env ZK path root (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/957967 (https://phabricator.wikimedia.org/T342149) (owner: 10Bking)
[22:23:09] <jinxer-wm>	 (JobUnavailable) firing: (6) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:23:32] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=jumbo-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[22:37:04] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1029 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:38:28] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1029 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:40:12] <icinga-wm>	 RECOVERY - Check systemd state on deploy2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:44:30] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: imagecatalog_record.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:45:26] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED
[22:46:12] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:46:18] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED
[22:47:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[22:51:24] <logmsgbot>	 !log bearloga@deploy2002 Started deploy [airflow-dags/analytics_product@b603e64]: (no justification provided)
[22:51:29] <logmsgbot>	 !log bearloga@deploy2002 Finished deploy [airflow-dags/analytics_product@b603e64]: (no justification provided) (duration: 00m 05s)
[22:54:05] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED
[22:56:03] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1226.mgmt.eqiad.wmnet with reboot policy FORCED
[22:56:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1227.mgmt.eqiad.wmnet with reboot policy FORCED
[22:57:18] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1230.mgmt.eqiad.wmnet with reboot policy FORCED
[22:57:23] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1228.mgmt.eqiad.wmnet with reboot policy FORCED
[22:58:22] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1231.mgmt.eqiad.wmnet with reboot policy FORCED
[22:58:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1232.mgmt.eqiad.wmnet with reboot policy FORCED
[22:58:25] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1233.mgmt.eqiad.wmnet with reboot policy FORCED
[23:07:40] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:09:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:12:49] <wikibugs>	 (03PS1) 10Cwhite: Revert "Add the analytics and search-platform teams to flink zk contacts" [puppet] - 10https://gerrit.wikimedia.org/r/959015
[23:13:36] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Revert "Add the analytics and search-platform teams to flink zk contacts" [puppet] - 10https://gerrit.wikimedia.org/r/959015 (owner: 10Cwhite)
[23:16:16] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1023 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:18:39] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1231.mgmt.eqiad.wmnet with reboot policy FORCED
[23:18:43] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1233.mgmt.eqiad.wmnet with reboot policy FORCED
[23:19:06] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:23:16] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr)
[23:24:10] <icinga-wm>	 RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga
[23:26:20] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1232.mgmt.eqiad.wmnet with reboot policy FORCED
[23:26:44] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr)
[23:27:22] <wikibugs>	 10SRE, 10AQS2.0, 10Cassandra, 10serviceops, 10Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (10Eevans) Apologies for the amount of time that has passed, I only just noticed this ticket.  I took a quick scan of the repo and hav...
[23:29:28] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[23:30:49] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[23:31:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.provision for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED
[23:35:31] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/feed/announcements (Retrieve announcements) is CRITICAL: Test Retrieve announcements returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:36:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:40:17] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install db12[26-33] - https://phabricator.wikimedia.org/T342176 (10Jclark-ctr)
[23:51:02] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1229.mgmt.eqiad.wmnet with reboot policy FORCED
[23:51:43] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1226']
[23:51:47] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1227']
[23:52:15] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1032 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[23:52:49] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1228']
[23:52:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1229']
[23:52:57] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db1230']
[23:55:03] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1032 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase