[01:34:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [01:38:03] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [02:07:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:45] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:46] (03PS1) 10Superpes15: [tawiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890198 (https://phabricator.wikimedia.org/T329248) [03:15:06] 10SRE, 10DNS, 10Traffic, 10Chinese-Sites: Let all requests from mainland China will be processed to codfw/esams/drmrs - https://phabricator.wikimedia.org/T330024 (10Shizhao) [03:49:06] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:22] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:36:22] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:36:30] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [04:36:58] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:04:12] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:04:12] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:04:16] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:04:46] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:16] PROBLEM - BFD status on cr1-drmrs is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:28:18] PROBLEM - OSPF status on cr1-drmrs is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:28:20] PROBLEM - BFD status on cr1-eqiad is CRITICAL: CRIT: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:28:50] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:30:40] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:31:58] RECOVERY - BFD status on cr1-drmrs is OK: OK: UP: 3 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:31:58] RECOVERY - OSPF status on cr1-drmrs is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:32:02] RECOVERY - BFD status on cr1-eqiad is OK: OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [05:38:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [05:42:48] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [06:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:40:47] I am restarting both Gerrit instances this morning [06:40:55] !log Restarting Gerrit [06:40:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:34] (03PS3) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [07:22:51] (03CR) 10Elukey: [C: 03+2] sre.k8s.upgrade-cluster: simplify code and extend downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/889962 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [07:23:27] (03CR) 10Elukey: [C: 03+2] Add istio and kserve settings for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [07:24:31] 10SRE, 10Language-Team: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10santhosh) [07:26:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:26:10] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:27:19] 10ops-codfw, 10Infrastructure-Foundations, 10netops: asw-a-codfw management interface unreachable - https://phabricator.wikimedia.org/T330048 (10ayounsi) p:05Triage→03High [07:27:40] !log running migrateTagTemplate.php on all wikis (T329766) [07:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:45] T329766: Run the maintenance script linter extension migrateTagTemplate.php on all wikis - https://phabricator.wikimedia.org/T329766 [07:28:45] (03Merged) 10jenkins-bot: Add istio and kserve settings for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/889760 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [07:30:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2007:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:31:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:31:42] PROBLEM - Check systemd state on kubestagemaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: kube-controller-manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:35:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2007:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:36:22] (03CR) 10Elukey: [C: 03+2] ml-services: update docker images for outlink and revscoring [deployment-charts] - 10https://gerrit.wikimedia.org/r/889773 (https://phabricator.wikimedia.org/T328576) (owner: 10Elukey) [07:38:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (2) Blazegraph instance wdqs2003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:39:18] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:39:26] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:39:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:40:00] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:40:35] (03PS1) 10Slyngshede: P:ldap:bitu add missing aux ldap group. [puppet] - 10https://gerrit.wikimedia.org/r/890229 [07:40:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:40:53] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:41:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [07:41:11] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [07:43:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (2) Blazegraph instance wdqs2003:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:44:06] !log imported jenkins 2.375.3 to thirdparty/ci T330045 [07:44:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:44:10] T330045: Upgrade Jenkins to latest LTS 2.375.3 - https://phabricator.wikimedia.org/T330045 [07:44:48] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890229 (owner: 10Slyngshede) [07:44:55] (03CR) 10Slyngshede: [C: 03+2] P:ldap:bitu add missing aux ldap group. [puppet] - 10https://gerrit.wikimedia.org/r/890229 (owner: 10Slyngshede) [07:47:37] (03PS1) 10Elukey: role::ml_k8s::staging: set istio-cni to 1.15.x [puppet] - 10https://gerrit.wikimedia.org/r/890230 (https://phabricator.wikimedia.org/T327767) [07:48:31] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs1013:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:48:46] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (3) Blazegraph instance wdqs2002:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:49:27] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39724/console" [puppet] - 10https://gerrit.wikimedia.org/r/890230 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [07:53:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (3) Blazegraph instance wdqs1013:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:58:46] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: (5) Blazegraph instance wdqs1012:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [07:59:31] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: (4) Blazegraph instance wdqs1012:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T0800) [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:01:48] RECOVERY - Check systemd state on ms-be1069 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:04:34] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39725/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [08:06:32] (03CR) 10Elukey: [V: 03+1 C: 03+2] role::ml_k8s::staging: set istio-cni to 1.15.x [puppet] - 10https://gerrit.wikimedia.org/r/890230 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [08:07:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39726/console" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [08:08:48] !log updating openjdk-11 on elastic* servers T329957 [08:08:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:53] T329957: Restart Elastic services to pick up JRE updates - https://phabricator.wikimedia.org/T329957 [08:09:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs2005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:10:21] these BlazegraphFreeAllocatorsDecreasingRapidly alerts are all false positives... [08:14:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs2005:9194 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:14:27] (03PS1) 10DCausse: team-search-platform: relax BlazegraphFreeAllocatorsDecreasingRapidly [alerts] - 10https://gerrit.wikimedia.org/r/890232 [08:16:30] (03CR) 10Nicolas Fraison: admin: create nfraison user and add it to analytics_privatedata_users (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/886890 (https://phabricator.wikimedia.org/T328915) (owner: 10Nicolas Fraison) [08:20:23] @Amir , urbanecm If I add a patch on the list, can anyone of you deploy? :) [08:21:00] *@Amir1 [08:21:08] I'm around [08:21:22] Superpes: but wait for a bit, I'm about to make some large changes to IS.php [08:22:19] 10SRE, 10Language-Team: Hosting machine request for machine translation - https://phabricator.wikimedia.org/T329971 (10elukey) Thanks a lot for the write-up @santhosh! Since the service is written in Python, I am wondering if we could host it on Lift Wing (the new ML infra that should replace ORES). Lift Wing... [08:25:54] Superpes: what are your changes? [08:26:48] Np :D I noticed the table was empty, that's why I said so, but if you're busy don't worry :) [08:26:56] Amir1 It's https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/890198/ [08:27:23] I'm about to move all of NS-related pieces to a dedicated file :D [08:27:33] T308932 [08:27:33] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [08:27:53] LMAO [08:28:29] (03PS3) 10Nicolas Fraison: resuse-zookeeper-data: add reuse partman conf for zk data [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) [08:32:36] (03CR) 10Jelto: [C: 04-1] "comment in line" [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [08:34:08] (03PS1) 10Ladsgroup: Move all of NS-related config out of IS.php to a dedicated file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890234 (https://phabricator.wikimedia.org/T308932) [08:36:00] (03CR) 10Ladsgroup: [C: 03+2] Move all of NS-related config out of IS.php to a dedicated file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890234 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [08:36:38] (03Merged) 10jenkins-bot: Move all of NS-related config out of IS.php to a dedicated file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890234 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [08:38:37] (03PS1) 10Elukey: Update istio specs for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/890235 (https://phabricator.wikimedia.org/T327767) [08:39:40] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:39:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:40:10] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:40:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:40:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:40:44] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:40:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:40:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:41:02] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:41:15] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:41:21] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:41:24] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:41:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:41:55] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Mepps out of all services on: 1067 hosts [08:41:55] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:42:01] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [08:42:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Mepps out of all services on: 1067 hosts [08:43:44] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:43:49] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Mepps out of all services on: 946 hosts [08:44:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Mepps out of all services on: 946 hosts [08:45:44] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [08:47:10] (03PS1) 10RLazarus: mediawiki-cache-warmup: Rewrite in Python [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) [08:49:51] (03CR) 10CI reject: [V: 04-1] mediawiki-cache-warmup: Rewrite in Python [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) (owner: 10RLazarus) [08:50:08] (03CR) 10Ayounsi: [C: 03+2] Remove "old" VRRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/838171 (https://phabricator.wikimedia.org/T260363) (owner: 10Ayounsi) [08:51:19] (03PS2) 10RLazarus: mediawiki-cache-warmup: Rewrite in Python [puppet] - 10https://gerrit.wikimedia.org/r/890299 (https://phabricator.wikimedia.org/T288867) [08:52:13] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [08:52:30] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update wmf-plugin - ayounsi@cumin1001 [08:54:07] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update wmf-plugin - ayounsi@cumin1001 [08:54:32] !log ladsgroup@deploy1002 Synchronized wmf-config/core-Namespaces.php: Move all of NS-related config out of IS.php to a dedicated file, part I (T308932) (duration: 16m 10s) [08:54:36] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [08:56:38] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:56:50] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [08:58:11] these are due to ml-staging not being fully bootstrapped --^ [08:58:13] working on it [08:58:52] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:01:51] !log ladsgroup@deploy1002 Synchronized multiversion/MWConfigCacheGenerator.php: Move all of NS-related config out of IS.php to a dedicated file, part II (T308932) (duration: 06m 47s) [09:01:55] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [09:02:12] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:02:22] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:03:11] (03CR) 10Jaime Nuche: jenkins: remove hardcoded values from sudo rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [09:03:23] (03Abandoned) 10Jaime Nuche: jenkins: remove hardcoded values from sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/886373 (https://phabricator.wikimedia.org/T323909) (owner: 10Jaime Nuche) [09:05:36] (03CR) 10Ayounsi: [C: 03+1] "```" [puppet] - 10https://gerrit.wikimedia.org/r/889113 (owner: 10Volans) [09:05:53] (03CR) 10Nikerabbit: "This change affects all Wikimedia wikis using the Translate extension. I don't it's possible to get "consensus" because it's difficult to " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [09:06:38] !log delete old group ID custom field from Netbox - https://netbox.wikimedia.org/extras/custom-fields/6/ - T260363 [09:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:42] T260363: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 [09:09:03] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:09:26] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Move all of NS-related config out of IS.php to a dedicated file, part III (T308932) (duration: 06m 24s) [09:09:30] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [09:10:12] (03CR) 10David Caro: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/881711 (owner: 10Majavah) [09:13:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:13:26] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:20:00] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Jelto) [09:22:35] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890346 (https://phabricator.wikimedia.org/T330056) [09:23:57] 10SRE, 10Infrastructure-Foundations, 10netops: Standardize VRRP group IDs - https://phabricator.wikimedia.org/T260363 (10ayounsi) 05Open→03Declined We're slowly moving away from VRRP. The benefits of renumbering them all is not worth the time, especially as we removed the custom field in favor or {T311218}. [09:25:18] (03CR) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [09:26:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330056 [09:26:55] T330056: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T330056 [09:27:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 34 hosts with reason: Primary switchover s8 T330056 [09:27:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db2161 with weight 0 T330056', diff saved to https://phabricator.wikimedia.org/P44686 and previous config saved to /var/cache/conftool/dbconfig/20230220-092727-ladsgroup.json [09:32:08] (03CR) 10Superpes15: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890198 (https://phabricator.wikimedia.org/T329248) (owner: 10Superpes15) [09:33:10] RECOVERY - Disk space on dumpsdata1001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1001&var-datasource=eqiad+prometheus/ops [09:33:54] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [09:35:22] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:35:34] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [09:37:19] (03CR) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) (owner: 10Winston Sung) [09:37:56] (03PS4) 10Winston Sung: Update $wgTranslateDisabledTargetLanguages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889616 (https://phabricator.wikimedia.org/T328838) [09:38:46] RECOVERY - Disk space on dumpsdata1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=dumpsdata1003&var-datasource=eqiad+prometheus/ops [09:39:01] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [09:43:02] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [09:43:37] (03CR) 10Ayounsi: [C: 03+2] Remove pfw BFD special case [puppet] - 10https://gerrit.wikimedia.org/r/889062 (https://phabricator.wikimedia.org/T329272) (owner: 10Ayounsi) [09:43:43] (03PS2) 10Ayounsi: Remove pfw BFD special case [puppet] - 10https://gerrit.wikimedia.org/r/889062 (https://phabricator.wikimedia.org/T329272) [09:44:05] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [09:44:13] (03PS13) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [09:44:20] (03PS8) 10Clément Goubert: sre.discovery.datacenter: Add 'all' to status [cookbooks] - 10https://gerrit.wikimedia.org/r/889539 [09:45:13] (03PS2) 10Elukey: Update istio specs for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/890235 (https://phabricator.wikimedia.org/T327767) [09:45:15] (03PS1) 10Elukey: knative-serving: remove env variables for k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/890345 (https://phabricator.wikimedia.org/T327767) [09:45:22] (03PS1) 10Volans: python_deploy: fix path for post-deploy [puppet] - 10https://gerrit.wikimedia.org/r/890386 [09:46:09] (03CR) 10Ayounsi: [C: 03+1] python_deploy: fix path for post-deploy [puppet] - 10https://gerrit.wikimedia.org/r/890386 (owner: 10Volans) [09:47:25] (03CR) 10Elukey: [C: 03+1] resuse-zookeeper-data: add reuse partman conf for zk data [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) (owner: 10Nicolas Fraison) [09:48:14] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:48:28] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [09:48:33] !log Point out risk of MW train failing on Feb 21st in https://wikitech.wikimedia.org/wiki/Deployments#Tuesday,_February_21 due to WikiKube codfw upgrade [09:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:39] (03PS2) 10Ladsgroup: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890346 (https://phabricator.wikimedia.org/T330056) (owner: 10Gerrit maintenance bot) [09:48:50] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/890346 (https://phabricator.wikimedia.org/T330056) (owner: 10Gerrit maintenance bot) [09:49:49] (03PS2) 10Elukey: knative-serving,kserve: remove env variables for k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/890345 (https://phabricator.wikimedia.org/T327767) [09:49:53] (03PS3) 10Elukey: Update istio specs for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/890235 (https://phabricator.wikimedia.org/T327767) [09:50:49] XioNoX: hey, you're holding the puppet-merge lock, is that intentional? locking process tree: systemd---sshd---sshd---sshd(ayounsi)---bash---sudo(root)---puppet-merge---python3(gitpuppet) [09:51:05] Amir1: see -sre [09:51:10] oh oka [09:51:32] thanks! [09:52:31] !log Starting s8 codfw failover from db2165 to db2161 - T330056 [09:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:35] T330056: Switchover s8 master (db2165 -> db2161) - https://phabricator.wikimedia.org/T330056 [09:53:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db2161 to s8 primary T330056', diff saved to https://phabricator.wikimedia.org/P44687 and previous config saved to /var/cache/conftool/dbconfig/20230220-095308-ladsgroup.json [09:55:10] (03CR) 10Elukey: [C: 03+2] knative-serving,kserve: remove env variables for k8s api [deployment-charts] - 10https://gerrit.wikimedia.org/r/890345 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [09:55:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db2165 T330056', diff saved to https://phabricator.wikimedia.org/P44688 and previous config saved to /var/cache/conftool/dbconfig/20230220-095526-ladsgroup.json [09:57:46] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idp add IDM OIDC profile [puppet] - 10https://gerrit.wikimedia.org/r/889974 (owner: 10Slyngshede) [09:57:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [09:58:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [10:00:41] (03PS31) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [10:00:52] (03CR) 10Slyngshede: P:idm configure production IDM (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [10:01:09] (03CR) 10CI reject: [V: 04-1] P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [10:02:37] (03CR) 10Peter Fischer: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/890232 (owner: 10DCausse) [10:03:32] (03CR) 10DCausse: [C: 03+2] team-search-platform: relax BlazegraphFreeAllocatorsDecreasingRapidly [alerts] - 10https://gerrit.wikimedia.org/r/890232 (owner: 10DCausse) [10:04:46] (03Merged) 10jenkins-bot: team-search-platform: relax BlazegraphFreeAllocatorsDecreasingRapidly [alerts] - 10https://gerrit.wikimedia.org/r/890232 (owner: 10DCausse) [10:05:23] (03CR) 10Volans: [C: 03+2] python_deploy: fix path for post-deploy [puppet] - 10https://gerrit.wikimedia.org/r/890386 (owner: 10Volans) [10:05:58] RECOVERY - Check systemd state on kubestagemaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:09:32] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:10:14] RECOVERY - Check systemd state on kubestagemaster1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:04] !log ayounsi@cumin1001 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update wmf-plugin - ayounsi@cumin1001 [10:11:19] jouncebot: nowandnext [10:11:20] No deployments scheduled for the next 0 hour(s) and 48 minute(s) [10:11:20] In 0 hour(s) and 48 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1100) [10:12:39] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1001.eqiad.wmnet with reason: Update wmf-plugin - ayounsi@cumin1001 [10:12:42] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:13:46] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:14:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 2711 [10:15:13] (03CR) 10Muehlenhoff: [C: 03+2] Tweak scalability of KDC requests [puppet] - 10https://gerrit.wikimedia.org/r/889971 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [10:15:50] (03CR) 10Hashar: jenkins: fix directory in sudo rule (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/886911 (https://phabricator.wikimedia.org/T319406) (owner: 10Jelto) [10:16:21] !log ayounsi@cumin1001 END (FAIL) - Cookbook sre.network.peering (exit_code=99) with action 'email' for AS: 2711 [10:17:23] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgLexemeEnableNewAlpha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890387 (https://phabricator.wikimedia.org/T307866) [10:18:40] PROBLEM - Check systemd state on ms-be1069 is CRITICAL: CRITICAL - degraded: The following units failed: swift_rclone_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:00] (03CR) 10Jbond: [C: 03+1] spicerack: get authdns servers from config file [software/spicerack] - 10https://gerrit.wikimedia.org/r/889601 (https://phabricator.wikimedia.org/T329773) (owner: 10Clément Goubert) [10:19:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [10:20:09] (03PS1) 10Muehlenhoff: Revert "Tweak scalability of KDC requests" [puppet] - 10https://gerrit.wikimedia.org/r/890388 (https://phabricator.wikimedia.org/T329831) [10:21:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 38565 [10:21:36] !log zabe@deploy1002 Synchronized private/PrivateSettings.php: (no justification provided) (duration: 06m 59s) [10:21:44] !log deployed updated mitigations for T326691 [10:21:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:32] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Tweak scalability of KDC requests" [puppet] - 10https://gerrit.wikimedia.org/r/890388 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [10:22:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:23:18] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:23:43] (03CR) 10Ayounsi: "Now that there is progress on custom validation it's probably better to have this as a validator and decom the report once the validator i" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/812376 (https://phabricator.wikimedia.org/T303529) (owner: 10Cathal Mooney) [10:24:42] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [10:25:33] (03CR) 10Nicolas Fraison: [C: 03+2] resuse-zookeeper-data: add reuse partman conf for zk data [puppet] - 10https://gerrit.wikimedia.org/r/889954 (https://phabricator.wikimedia.org/T329362) (owner: 10Nicolas Fraison) [10:26:58] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:27:49] (03CR) 10Ayounsi: "Cleaning up my Gerrit dashboard, don't hesitate to re-add me as reviewer when needed." [puppet] - 10https://gerrit.wikimedia.org/r/634572 (owner: 10Jbond) [10:29:00] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:40] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:31:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:32:11] (03PS1) 10Muehlenhoff: Tweak scalability of KDC requests (v2) [puppet] - 10https://gerrit.wikimedia.org/r/890389 (https://phabricator.wikimedia.org/T329831) [10:34:48] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/890389 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [10:37:51] (03CR) 10Muehlenhoff: [C: 03+2] Tweak scalability of KDC requests (v2) [puppet] - 10https://gerrit.wikimedia.org/r/890389 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [10:37:58] (03CR) 10Slyngshede: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [10:41:21] (03PS1) 10JMeybohm: wikiube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T329664) [10:42:12] (03PS1) 10Majavah: P:ssl: remove unused deployment-prep certificates [puppet] - 10https://gerrit.wikimedia.org/r/890391 [10:42:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Move management routers ssh port - https://phabricator.wikimedia.org/T277438 (10ayounsi) Before we merge/deploy any of those changes, Rancid and [[ https://github.com/wikimedia/operations-software-homer/blob/de281c32054862799dbf8102ed627d7d... [10:42:45] (03CR) 10Jbond: [C: 03+1] "lgtm some minor optional nits" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [10:43:27] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39727/console" [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T329664) (owner: 10JMeybohm) [10:43:51] (03CR) 10Volans: [C: 03+2] spicerack: get authdns servers from config file [software/spicerack] - 10https://gerrit.wikimedia.org/r/889601 (https://phabricator.wikimedia.org/T329773) (owner: 10Clément Goubert) [10:44:13] (03PS25) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [10:44:48] (03CR) 10Jbond: [C: 03+2] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [10:44:50] PROBLEM - Kerberos KDC daemon on krb2001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:46:28] PROBLEM - Kerberos KDC daemon on krb1001 is CRITICAL: PROCS CRITICAL: 9 processes with args /usr/sbin/krb5kdc https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos%23Daemons_and_their_roles [10:47:53] (03Merged) 10jenkins-bot: spicerack: get authdns servers from config file [software/spicerack] - 10https://gerrit.wikimedia.org/r/889601 (https://phabricator.wikimedia.org/T329773) (owner: 10Clément Goubert) [10:48:22] ^ that's a monitoring glitch, the KDCs are fine, will make a new patch to address this [10:49:31] (03PS1) 10JMeybohm: admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T329664) [10:51:41] (03PS2) 10JMeybohm: wikikube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T329664) [10:52:39] (03PS4) 10Jbond: redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 [10:52:45] (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:52:47] (03PS1) 10Muehlenhoff: Adapt KDC monitoring to dynamic KDC worker count [puppet] - 10https://gerrit.wikimedia.org/r/890393 (https://phabricator.wikimedia.org/T329831) [10:52:50] (03CR) 10Jbond: [C: 03+2] redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 (owner: 10Jbond) [10:52:59] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [10:54:32] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:55:56] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/890393 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [10:56:05] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39729/console" [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T329664) (owner: 10JMeybohm) [10:56:20] (03Merged) 10jenkins-bot: redfish: allow for refreshing the manager info [software/spicerack] - 10https://gerrit.wikimedia.org/r/885873 (owner: 10Jbond) [10:58:22] (03CR) 10Jbond: redfish: add upload/update methods (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [10:58:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [10:59:04] (03PS26) 10Jbond: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1100) [11:01:16] (03CR) 10Volans: [C: 03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [11:01:29] (03CR) 10Jbond: [C: 03+1] Adapt KDC monitoring to dynamic KDC worker count [puppet] - 10https://gerrit.wikimedia.org/r/890393 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [11:02:47] (03PS1) 10Ayounsi: Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) [11:03:05] (03CR) 10Nicolas Fraison: [C: 03+1] Adapt KDC monitoring to dynamic KDC worker count [puppet] - 10https://gerrit.wikimedia.org/r/890393 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [11:03:07] (03CR) 10Jbond: [C: 03+2] redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [11:03:09] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:03:55] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 38565 [11:04:03] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 30781 [11:04:10] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39730/console" [puppet] - 10https://gerrit.wikimedia.org/r/824202 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro) [11:04:38] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 30781 [11:05:43] (03CR) 10Muehlenhoff: [C: 03+2] Adapt KDC monitoring to dynamic KDC worker count [puppet] - 10https://gerrit.wikimedia.org/r/890393 (https://phabricator.wikimedia.org/T329831) (owner: 10Muehlenhoff) [11:06:29] (03Merged) 10jenkins-bot: redfish: add upload/update methods [software/spicerack] - 10https://gerrit.wikimedia.org/r/884989 (owner: 10Jbond) [11:06:49] (03CR) 10CI reject: [V: 04-1] Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:06:54] (03CR) 10Nicolas Fraison: [C: 03+2] perf(presto): remove some configuration tuning [puppet] - 10https://gerrit.wikimedia.org/r/889581 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [11:07:04] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:07:50] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:08:32] (03CR) 10Jbond: [C: 03+1] sre.k8s.wipe-cluster: add extra ask_confirmation for etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889997 (owner: 10Elukey) [11:09:12] (03CR) 10Elukey: [C: 03+2] sre.k8s.wipe-cluster: add extra ask_confirmation for etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889997 (owner: 10Elukey) [11:09:16] (03PS2) 10Elukey: sre.k8s.wipe-cluster: add extra ask_confirmation for etcd [cookbooks] - 10https://gerrit.wikimedia.org/r/889997 [11:09:34] (03CR) 10Ladsgroup: "I'm not sure how this can be deployed without making things explode. Is it already cherry-picked in beta's puppetmaster? If not, try it an" [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [11:09:36] (03PS1) 10Hnowlan: Add service records for device-analytics using ingress. [dns] - 10https://gerrit.wikimedia.org/r/890398 (https://phabricator.wikimedia.org/T320967) [11:10:40] (03CR) 10Zabe: Don't create a second disk through lvm (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [11:11:00] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:11:04] (03PS7) 10Nicolas Fraison: perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) [11:11:13] (03PS8) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [11:11:15] (03PS2) 10Nicolas Fraison: presto: add 5 nodes to the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889994 (https://phabricator.wikimedia.org/T329525) [11:11:17] (03CR) 10Jbond: [C: 03+1] "lgtm the ci issue is unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:11:27] (03PS5) 10Ladsgroup: Don't create a second disk through lvm [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [11:11:34] (03CR) 10Ladsgroup: [C: 03+2] "beta only" [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [11:11:46] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Don't create a second disk through lvm [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [11:12:45] (JobUnavailable) firing: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:17:17] (03CR) 10Jbond: [C: 04-1] "lgtm apart from the missing comma" [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:17:26] (03PS2) 10Nicolas Fraison: presto: add last 5 nodes to prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889995 (https://phabricator.wikimedia.org/T329525) [11:17:29] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] "Noop in prod: ladsgroup@puppetdb1002:~$ curl -G localhost:8080/pdb/query/v4/resources --data-urlencode 'query=["and",["=","type","Class"]," [puppet] - 10https://gerrit.wikimedia.org/r/833130 (owner: 10Zabe) [11:17:49] (03PS12) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [11:18:02] (03CR) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. (034 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [11:18:26] (03PS13) 10Slyngshede: icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) [11:20:10] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:48] (03PS2) 10Ayounsi: Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) [11:22:24] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:22:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:25:28] (03CR) 10CI reject: [V: 04-1] Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [11:27:35] (03PS2) 10JMeybohm: admin_ng: Update wikikube-codfw settings to k8s 1.23 [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) [11:29:31] (03CR) 10JMeybohm: admin_ng: Update wikikube-codfw settings to k8s 1.23 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [11:29:41] (03CR) 10Muehlenhoff: [C: 03+2] gitlab: Remove net.core.somaxconn sysctl [puppet] - 10https://gerrit.wikimedia.org/r/889976 (owner: 10Muehlenhoff) [11:30:48] (03CR) 10Jbond: [C: 03+1] icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [11:37:31] (03PS3) 10JMeybohm: wikikube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) [11:39:02] (03PS1) 10Ilias Sarantopoulos: ml-services: upgrade revertrisk staging to debian bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/890401 (https://phabricator.wikimedia.org/T328439) [11:39:36] (03CR) 10Jbond: [C: 03+1] Add SPDX headers to additional DE profiles [puppet] - 10https://gerrit.wikimedia.org/r/890000 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [11:39:43] 10SRE, 10Infrastructure-Foundations: KDC performance tuning for TCP requests - https://phabricator.wikimedia.org/T329831 (10MoritzMuehlenhoff) 05Open→03Resolved The amount of KDC workers is now configurable via the new profile::kerberos::kdc::workers Hiera setting and has been raised to 8. In the addition... [11:42:51] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39731/console" [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [11:44:47] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for all the fixes!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [11:47:09] (03CR) 10Stevemunene: [C: 03+1] fix(presto): fix typo from node.enviroment to node.environment [puppet] - 10https://gerrit.wikimedia.org/r/889807 (owner: 10Nicolas Fraison) [11:47:24] (03CR) 10Stevemunene: [C: 03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [11:53:59] 10SRE, 10DBA, 10Data-Persistence, 10Infrastructure-Foundations, and 9 others: codfw row B switches upgrade - https://phabricator.wikimedia.org/T327991 (10Clement_Goubert) [11:58:11] (03CR) 10Slyngshede: [C: 03+2] icinga: allow wait_for_optimal to ignore ack'ed alerts. [software/spicerack] - 10https://gerrit.wikimedia.org/r/840128 (https://phabricator.wikimedia.org/T319277) (owner: 10Slyngshede) [12:00:35] (03CR) 10Klausman: [C: 03+1] ml-services: upgrade revertrisk staging to debian bullseye [deployment-charts] - 10https://gerrit.wikimedia.org/r/890401 (https://phabricator.wikimedia.org/T328439) (owner: 10Ilias Sarantopoulos) [12:02:02] (03PS1) 10Ayounsi: Rancid: use port 2222 for mgmt routers [puppet] - 10https://gerrit.wikimedia.org/r/890402 (https://phabricator.wikimedia.org/T277438) [12:08:19] !log upload openjdk-8 8u362-ga-4~deb11u1 to component/jdk8 for wikimedia-bullseye (forward port of latest Java 8 security fixes) [12:08:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:36] 10SRE, 10SRE-Access-Requests: Requesting access to deploymentt for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10Sgs) [12:12:23] !log installing Java 8 security updates on Bullseye [12:12:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:32] 10SRE, 10SRE-Access-Requests: Requesting access to deploymentt for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10DMburugu) I approve access to deployment for @Sgs [12:15:10] (03CR) 10Nicolas Fraison: [C: 03+2] perf(presto): add join-distribution-type to config [puppet] - 10https://gerrit.wikimedia.org/r/889582 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [12:15:59] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp2002.codfw.wmnet with OS bullseye [12:17:55] (03CR) 10Nicolas Fraison: [C: 03+2] chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [12:18:04] (03PS9) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 [12:18:39] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/890409 (owner: 10L10n-bot) [12:19:04] !log cgoubert@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on 32 hosts with reason: In setup [12:19:09] (03PS1) 10Alexandros Kosiaris: KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 [12:19:18] 10ops-eqiad, 10DC-Ops: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10jijiki) [12:19:27] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on 32 hosts with reason: In setup [12:19:31] 10SRE, 10serviceops: mw2420-mw2451 service implementation tracking - https://phabricator.wikimedia.org/T326363 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=7c189d79-c66e-4544-923a-2145f8cedf2f) set by cgoubert@cumin1001 for 14 days, 0:00:00 on 32 host(s) and their services with reason: I... [12:20:21] (03CR) 10CI reject: [V: 04-1] KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 (owner: 10Alexandros Kosiaris) [12:20:27] (03CR) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [12:21:07] (03PS3) 10Ayounsi: Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) [12:21:46] (03PS32) 10Slyngshede: P:idm configure production IDM [puppet] - 10https://gerrit.wikimedia.org/r/889753 [12:22:23] (03CR) 10Nicolas Fraison: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 3 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39732/console" [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [12:22:51] (03CR) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [12:23:14] 10ops-eqiad, 10DC-Ops: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10jijiki) [12:24:10] 10ops-eqiad, 10DC-Ops, 10serviceops: Reset management module of mc1039 - https://phabricator.wikimedia.org/T330072 (10jijiki) [12:24:45] (03PS1) 10Jbond: netbox: force database connections to use TLS. [puppet] - 10https://gerrit.wikimedia.org/r/890421 (https://phabricator.wikimedia.org/T296452) [12:25:16] (03Abandoned) 10Nicolas Fraison: chore(presto): remove kerberos config on analytics_test_cluster role [puppet] - 10https://gerrit.wikimedia.org/r/889583 (owner: 10Nicolas Fraison) [12:25:18] (03PS1) 10Volans: alertmanager: add parent Alertmanager class [software/spicerack] - 10https://gerrit.wikimedia.org/r/890422 [12:26:23] (03CR) 10CI reject: [V: 04-1] Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:26:25] (03PS2) 10Alexandros Kosiaris: KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 [12:26:37] (03PS3) 10Nicolas Fraison: presto: add 5 nodes to the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889994 (https://phabricator.wikimedia.org/T329525) [12:26:39] (03PS3) 10Nicolas Fraison: presto: add last 5 nodes to prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889995 (https://phabricator.wikimedia.org/T329525) [12:27:43] (03CR) 10CI reject: [V: 04-1] KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 (owner: 10Alexandros Kosiaris) [12:27:45] (03CR) 10Jbond: [C: 03+1] "lgtm thx" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:29:56] (03CR) 10Volans: [C: 03+1] "LGTM, can also be easily reverted if needed." [puppet] - 10https://gerrit.wikimedia.org/r/890421 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:30:56] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2002.codfw.wmnet with reason: host reimage [12:33:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/spicerack] - 10https://gerrit.wikimedia.org/r/890422 (owner: 10Volans) [12:33:55] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2002.codfw.wmnet with reason: host reimage [12:36:48] (03CR) 10Jbond: [C: 03+2] netbox: force database connections to use TLS. [puppet] - 10https://gerrit.wikimedia.org/r/890421 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:37:57] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/889753 (owner: 10Slyngshede) [12:43:00] (03PS4) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) [12:44:08] (03CR) 10Nicolas Fraison: [C: 03+2] presto: add 5 nodes to the prod cluster [puppet] - 10https://gerrit.wikimedia.org/r/889994 (https://phabricator.wikimedia.org/T329525) (owner: 10Nicolas Fraison) [12:45:41] jouncebot: nowandnext [12:45:41] No deployments scheduled for the next 1 hour(s) and 14 minute(s) [12:45:41] In 1 hour(s) and 14 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1400) [12:47:28] (03PS3) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) [12:49:15] !log switch netbox to active/active [12:49:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:21] (03CR) 10Jbond: [C: 03+2] netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/808199 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:50:15] (03CR) 10Ayounsi: "I think the CI errors are due to a change of behavior in prospector." [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [12:50:58] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2002.codfw.wmnet with OS bullseye [12:51:19] (03CR) 10Jbond: [C: 03+2] netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/808198 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [12:57:02] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for Sergio Gimeno - https://phabricator.wikimedia.org/T330070 (10Reedy) [12:57:45] (JobUnavailable) resolved: Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:02:09] (03PS1) 10Jbond: Revert "netbox: update netbox service to active/active" [puppet] - 10https://gerrit.wikimedia.org/r/890372 [13:02:33] (03PS1) 10Jbond: Revert "netbox: update netbox so that its active/active" [dns] - 10https://gerrit.wikimedia.org/r/890373 [13:02:49] (03CR) 10Jbond: [C: 03+2] Revert "netbox: update netbox service to active/active" [puppet] - 10https://gerrit.wikimedia.org/r/890372 (owner: 10Jbond) [13:03:18] (03CR) 10Jbond: [C: 03+2] Revert "netbox: update netbox so that its active/active" [dns] - 10https://gerrit.wikimedia.org/r/890373 (owner: 10Jbond) [13:03:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] Revert "netbox: update netbox so that its active/active" [dns] - 10https://gerrit.wikimedia.org/r/890373 (owner: 10Jbond) [13:05:49] 10SRE, 10serviceops: Connecting to https://api.svc.codfw.wmnet/ does not work - https://phabricator.wikimedia.org/T285517 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert This has been resolved in the meantime. ` cgoubert@cumin1001:~/cookbooks$ curl https://api.svc.eqiad.wmnet/ -vI 2>&1 | grep... [13:05:52] 10SRE, 10serviceops, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Clement_Goubert) [13:06:39] !log switch netbox to active/passive (had issues with active/active config) [13:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:10:25] 10SRE, 10serviceops, 10Datacenter-Switchover: Siteinfo timeout during switch datacenter - https://phabricator.wikimedia.org/T266618 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert This has been fixed in https://gerrit.wikimedia.org/r/c/operations/software/spicerack/+/779841 [13:11:42] last couple of deploys I've noticed `mwlog1002` being noticeably slow (especially when running `logspam-watch`). I'm not sure if `3.34, 3.09, 3.02` is expected load for the server? Should I log a ticket, or is that normal as far as anyone knows? [13:12:43] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1001.eqiad.wmnet with OS bullseye [13:16:49] TheresNoTime: There seems to have been a jump in disk IO since 13/02 https://grafana.wikimedia.org/goto/9Zs8IlJVz?orgId=1 [13:17:34] Not sure if that's normal, I'd ask #wikimedia-observability [13:18:54] ack thanks :) annoyingly no `iotop` on there.. [13:21:10] 10SRE, 10Observability-Logging, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Clement_Goubert) @fgiunchedi Is this still relevant? Are there some specific steps to be taken for {T327920} ? [13:21:47] (03PS42) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [13:22:25] TheresNoTime: We can probably add it. iostat show 0% iowait despite the high disk usage [13:22:46] But it is regularly showing over 70% wrqm [13:25:47] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39733/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [13:26:01] based on htop DISK R/W it's demux.py [13:29:19] (looking too) [13:29:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:30:01] I'll leave you to it, I think it's mostly disk contention but there's no iowait [13:30:15] (thanks both!) [13:30:42] yeah I'm wondering what caused this [13:30:43] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&var-server=mwlog1002&var-datasource=thanos&var-cluster=misc&from=1676133530280&to=1676354890864 [13:34:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:43:35] TheresNoTime claime I think it is the md raid background check [13:43:41] [================>....] check = 83.6% (13070869632/15627202560) finish=2029.5min speed=20992K/sec [13:43:47] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:45:03] huh, odd [13:45:20] (or is it odd, I'm not sure what that is supposed to run.) [13:45:28] s/what/when [13:46:29] (03PS1) 10Muehlenhoff: Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) [13:46:52] (03CR) 10CI reject: [V: 04-1] Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:47:48] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [13:49:04] (03PS2) 10Muehlenhoff: Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) [13:50:37] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp2003.codfw.wmnet with OS bullseye [13:51:04] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1001.eqiad.wmnet with OS bullseye [13:51:30] (03CR) 10CI reject: [V: 04-1] Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [13:53:46] TheresNoTime: offhand me neither, do you mind opening a task to observability-logging tag so we don't lose track ? I'll followup there [13:53:54] Sure :) [13:54:17] thank you [13:55:40] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1001.eqiad.wmnet with OS bullseye [13:57:36] (03PS1) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [13:58:49] T330081 [13:58:50] T330081: High load/disk IO on mwlog1002 - https://phabricator.wikimedia.org/T330081 [14:00:04] (03PS1) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/890380 (https://phabricator.wikimedia.org/T296452) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1400). [14:00:05] Lucas_WMDE and cirno: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:08] Unrelated, I can deploy :D [14:00:11] (03PS1) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/890381 (https://phabricator.wikimedia.org/T296452) [14:00:17] hi [14:00:20] o/ [14:00:34] shall we start with yours cirno? [14:00:45] I'll need a few more minutes :) (sent from phone) [14:01:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890187 (https://phabricator.wikimedia.org/T330026) (owner: 10Stang) [14:01:00] TheresNoTime, ok [14:01:36] (03Merged) 10jenkins-bot: zhwiki(books|quote): Enable block feature for AbuseFilter [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890187 (https://phabricator.wikimedia.org/T330026) (owner: 10Stang) [14:01:44] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39735/console" [puppet] - 10https://gerrit.wikimedia.org/r/890381 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:01:54] !log samtar@deploy1002 Started scap: Backport for [[gerrit:890187|zhwiki(books|quote): Enable block feature for AbuseFilter (T330026)]] [14:01:58] T330026: Enable block feature for AbuseFilter on zhwiki(books|quote) - https://phabricator.wikimedia.org/T330026 [14:02:11] (03CR) 10Jbond: [V: 03+1 C: 03+2] netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/890381 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:02:55] (03PS3) 10Muehlenhoff: Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) [14:03:26] TheresNoTime: cheers [14:03:41] !log samtar@deploy1002 samtar and stang: Backport for [[gerrit:890187|zhwiki(books|quote): Enable block feature for AbuseFilter (T330026)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:03:44] cirno: live on mwdebug, can you test? [14:03:49] looking [14:04:11] (03CR) 10Jbond: [C: 03+2] netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/890380 (https://phabricator.wikimedia.org/T296452) (owner: 10Jbond) [14:04:16] TheresNoTime, tested on both sites and works as expected [14:04:21] syncing [14:04:28] * Lucas_WMDE here [14:04:56] Lucas_WMDE: did you want to self-serve after cir/no's is finished syncing or shall I do it? :) [14:05:24] either way is fine for me :) [14:05:29] (03CR) 10CI reject: [V: 04-1] Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [14:05:30] my change is a no-op anyway, just config cleanup [14:05:31] (03PS2) 10Samtar: Remove unused $wgLexemeEnableNewAlpha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890387 (https://phabricator.wikimedia.org/T307866) (owner: 10Lucas Werkmeister (WMDE)) [14:05:37] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp2003.codfw.wmnet with reason: host reimage [14:05:49] I'll put it through after then ^ [14:05:55] cool, thanks [14:05:55] * ^^ [14:06:27] PROBLEM - gdnsd checkconf on dns6002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:06:35] PROBLEM - gdnsd checkconf on dns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:07:39] RECOVERY - gdnsd checkconf on dns6002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:07:49] RECOVERY - gdnsd checkconf on dns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:07:49] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01136 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:08:39] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp2003.codfw.wmnet with reason: host reimage [14:10:55] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:890187|zhwiki(books|quote): Enable block feature for AbuseFilter (T330026)]] (duration: 09m 00s) [14:10:58] T330026: Enable block feature for AbuseFilter on zhwiki(books|quote) - https://phabricator.wikimedia.org/T330026 [14:11:02] cirno: live :) [14:11:07] (03PS4) 10Muehlenhoff: Extend profile::nginx with support for new Nginx packaging layout [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) [14:11:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890387 (https://phabricator.wikimedia.org/T307866) (owner: 10Lucas Werkmeister (WMDE)) [14:11:38] (03PS1) 10Jbond: Revert "netbox: update netbox service to active/active" [puppet] - 10https://gerrit.wikimedia.org/r/890382 [14:11:56] (03PS1) 10Jbond: Revert "netbox: update netbox so that its active/active" [dns] - 10https://gerrit.wikimedia.org/r/890383 [14:12:11] (03Merged) 10jenkins-bot: Remove unused $wgLexemeEnableNewAlpha [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890387 (https://phabricator.wikimedia.org/T307866) (owner: 10Lucas Werkmeister (WMDE)) [14:12:25] (03CR) 10Jbond: [C: 03+2] Revert "netbox: update netbox service to active/active" [puppet] - 10https://gerrit.wikimedia.org/r/890382 (owner: 10Jbond) [14:12:26] !log samtar@deploy1002 Started scap: Backport for [[gerrit:890387|Remove unused $wgLexemeEnableNewAlpha (T307866)]] [14:12:30] T307866: Replace existing Special:NewLexeme page with the new one - https://phabricator.wikimedia.org/T307866 [14:12:43] (03PS2) 10Elukey: Replace underscores with hyphens in ml-serve's etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/889661 (https://phabricator.wikimedia.org/T324542) [14:13:27] (03CR) 10Jbond: [C: 03+2] Revert "netbox: update netbox so that its active/active" [dns] - 10https://gerrit.wikimedia.org/r/890383 (owner: 10Jbond) [14:14:06] (03PS1) 10Volans: setup.py: remove support for Python 3.7 and 3.8 [software/homer] - 10https://gerrit.wikimedia.org/r/890435 [14:14:07] !log samtar@deploy1002 lucaswerkmeister-wmde and samtar: Backport for [[gerrit:890387|Remove unused $wgLexemeEnableNewAlpha (T307866)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:14:10] Lucas_WMDE: nothing you need to check, that variable is unused yeah? Will sync [14:14:15] yeah [14:14:25] syncing [14:15:08] oops, grep -rlF finds one more use – in IS-labs [14:15:12] probably shoulda removed that at the same time eh [14:15:17] (03CR) 10Volans: Allow different port than default 22 (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [14:15:20] * Lucas_WMDE goes to make another patch [14:15:34] :D [14:15:44] (03PS1) 10Matthias Mullie: [SearchVue] Add wgQuickViewDataRepositoryApiBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890436 (https://phabricator.wikimedia.org/T307085) [14:16:05] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused $wgLexemeEnableNewAlpha from IS-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890437 [14:16:07] ^ [14:16:25] (I left out the Bug: line for this one, no need to spam the task again) [14:16:42] PROBLEM - gdnsd checkconf on dns6002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:16:52] PROBLEM - gdnsd checkconf on authdns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:16:52] PROBLEM - gdnsd checkconf on dns4004 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:16:54] PROBLEM - gdnsd checkconf on dns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:17:08] PROBLEM - gdnsd checkconf on dns5003 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:17:36] PROBLEM - gdnsd checkconf on dns1002 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:17:36] PROBLEM - gdnsd checkconf on dns3001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:17:36] PROBLEM - gdnsd checkconf on dns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure https://wikitech.wikimedia.org/wiki/gdnsd [14:17:58] RECOVERY - gdnsd checkconf on dns6002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:08] PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.01235 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:18:08] RECOVERY - gdnsd checkconf on authdns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:08] RECOVERY - gdnsd checkconf on dns4004 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:10] RECOVERY - gdnsd checkconf on dns1001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:24] RECOVERY - gdnsd checkconf on dns5003 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:50] RECOVERY - gdnsd checkconf on dns3001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:50] RECOVERY - gdnsd checkconf on dns1002 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:18:50] RECOVERY - gdnsd checkconf on dns2001 is OK: OK: gdnsd -S checkconf success https://wikitech.wikimedia.org/wiki/gdnsd [14:20:11] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:890387|Remove unused $wgLexemeEnableNewAlpha (T307866)]] (duration: 07m 44s) [14:20:15] T307866: Replace existing Special:NewLexeme page with the new one - https://phabricator.wikimedia.org/T307866 [14:20:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890437 (owner: 10Lucas Werkmeister (WMDE)) [14:20:32] thanks ^^ [14:20:43] jbond: I take it the gdnsd / puppet failures were due to you change and the puppet alert will recover ? [14:20:58] (03Merged) 10jenkins-bot: Remove unused $wgLexemeEnableNewAlpha from IS-labs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890437 (owner: 10Lucas Werkmeister (WMDE)) [14:21:23] Lucas_WMDE: all done :) [14:21:29] yay [14:21:37] oh right, it skips the sync for that doesn’t it [14:21:43] I might have one more change to deploy [14:21:51] (currently being discussed if it should be deployed yet or not ^^) [14:21:55] I can also do it myself later [14:21:59] sure thing, I [14:22:06] *I'm around for another 20 minutes or so [14:22:09] ok [14:22:55] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netbox: Issues converting services from active/passive to active/active - https://phabricator.wikimedia.org/T330084 (10jbond) [14:22:59] (03PS1) 10Jbond: netbox: update netbox so that its active/active [dns] - 10https://gerrit.wikimedia.org/r/890384 (https://phabricator.wikimedia.org/T296452) [14:23:20] (03CR) 10Elukey: [C: 03+1] "Nice split I like it!" [software/spicerack] - 10https://gerrit.wikimedia.org/r/890422 (owner: 10Volans) [14:23:23] Nothing currently being deployed AAUI - mind if I merge https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/890436 (beta-only config) real quick? [14:23:33] (03PS1) 10Jbond: netbox: update netbox service to active/active [puppet] - 10https://gerrit.wikimedia.org/r/890385 (https://phabricator.wikimedia.org/T296452) [14:23:34] matthiasmullie: go ahead [14:23:35] (03PS2) 10Matthias Mullie: [SearchVue] Add wgQuickViewDataRepositoryApiBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890436 (https://phabricator.wikimedia.org/T307085) [14:23:41] matthiasmullie: fine as far as I’m concerned :) [14:24:02] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890436 (https://phabricator.wikimedia.org/T307085) (owner: 10Matthias Mullie) [14:24:38] (03Merged) 10jenkins-bot: [SearchVue] Add wgQuickViewDataRepositoryApiBaseUri [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890436 (https://phabricator.wikimedia.org/T307085) (owner: 10Matthias Mullie) [14:24:59] Done, thanks! [14:25:10] RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005432 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [14:25:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp2003.codfw.wmnet with OS bullseye [14:25:31] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/890432 (https://phabricator.wikimedia.org/T329529) (owner: 10Muehlenhoff) [14:25:42] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox, 10Patch-For-Review: Issues converting services from active/passive to active/active - https://phabricator.wikimedia.org/T330084 (10jbond) p:05Triage→03High [14:26:13] (03PS3) 10Elukey: Replace underscores with hyphens in ml-serve's etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/889661 (https://phabricator.wikimedia.org/T324542) [14:27:47] alright, i have another config change to deploy after all [14:27:49] just added it to the calendar: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/885422/ [14:27:54] TheresNoTime: should I deploy or do you want to? [14:28:08] Lucas_WMDE: I can, already logged in :) [14:28:17] I’m logged in too :P [14:28:25] but ok, thanks :) [14:28:25] Lucas_WMDE: go for it :) [14:28:28] ok! [14:28:36] (my grep was for all of /srv/mediawiki-staging, to cover the extensions too) [14:28:38] hah, yes, you deploy :) [14:28:44] (03PS4) 10Lucas Werkmeister (WMDE): Enable WIP Wikibase REST API routes on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [14:28:46] (03PS1) 10Volans: sre.hosts.provision: add PowerEdge R650xs [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) [14:29:01] 10SRE, 10Observability-Logging, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10fgiunchedi) >>! In T261274#8629518, @Clement_Goubert wrote: > @fgiunchedi Is this still relevant? Are there some specific steps to be taken for {... [14:29:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [14:29:23] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Volans) >>! In T326342#8625780, @Papaul wrote: > @Volans fyi the 3 db nodes above are R650xs just receives those. We worked already on 1 R650 in the pass. On the 650xs provisi... [14:29:49] (03Merged) 10jenkins-bot: Enable WIP Wikibase REST API routes on beta wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/885422 (https://phabricator.wikimedia.org/T326313) (owner: 10Ollie Shotton) [14:30:04] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:885422|Enable WIP Wikibase REST API routes on beta wikidata (T326313)]] [14:30:08] T326313: Enable in-progress/in-development Wikibase REST API routes on Beta Wikidata - https://phabricator.wikimedia.org/T326313 [14:30:40] (03CR) 10Klausman: [C: 03+1] Replace underscores with hyphens in ml-serve's etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/889661 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [14:31:16] (03CR) 10Volans: [C: 03+2] alertmanager: add parent Alertmanager class [software/spicerack] - 10https://gerrit.wikimedia.org/r/890422 (owner: 10Volans) [14:31:42] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and ollieshotton: Backport for [[gerrit:885422|Enable WIP Wikibase REST API routes on beta wikidata (T326313)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [14:32:12] https://www.wikidata.org/w/rest.php/wikibase/v0/entities/items/Q42/labels is still a non-existing route on mwdebug, so that’s good [14:32:20] (it should become available on beta once the next update happens there) [14:32:29] syncing [14:34:51] (03Merged) 10jenkins-bot: alertmanager: add parent Alertmanager class [software/spicerack] - 10https://gerrit.wikimedia.org/r/890422 (owner: 10Volans) [14:35:11] (03CR) 10Klausman: [C: 03+2] Replace underscores with hyphens in ml-serve's etcd SRV records [dns] - 10https://gerrit.wikimedia.org/r/889661 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [14:35:47] (03PS2) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [14:36:14] (03CR) 10Klausman: [C: 03+2] role::etcd::v3::ml_etcd: use PKI TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/889663 (https://phabricator.wikimedia.org/T324542) (owner: 10Elukey) [14:36:43] (03CR) 10Elukey: [C: 03+2] Update istio specs for ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/890235 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [14:38:16] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:885422|Enable WIP Wikibase REST API routes on beta wikidata (T326313)]] (duration: 08m 12s) [14:38:21] T326313: Enable in-progress/in-development Wikibase REST API routes on Beta Wikidata - https://phabricator.wikimedia.org/T326313 [14:38:32] I think that’s it from me :) [14:38:37] anything else to deploy? [14:40:37] (03PS1) 10Muehlenhoff: Switch role::puppetdb to Nginx custom flavour [puppet] - 10https://gerrit.wikimedia.org/r/890439 (https://phabricator.wikimedia.org/T321783) [14:40:47] Lucas_WMDE: nothing here, but if you have a sec — any idea why this `specialPageAliases` isn't working? https://gerrit.wikimedia.org/r/plugins/gitiles/mediawiki/extensions/PageAssessments/+/refs/heads/wmf/1.40.0-wmf.23/PageAssessments.i18n.alias.php#27 ref T328224 [14:40:48] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [14:40:58] * Lucas_WMDE looks [14:41:02] `'PageAssessents' => [ 'पृष्ठ_मूल्याङ्कन' ],` [14:41:06] missing *m*? [14:41:07] assessents [14:41:13] -.- [14:41:18] (: [14:41:22] * TheresNoTime screams [14:41:33] well I'm going to fix that and backport, so gimme a sec.. [14:41:38] sure [14:41:47] * Lucas_WMDE refrains from logging the done message [14:43:30] (03PS3) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [14:43:58] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:44:15] Lucas_WMDE: can I get a quick +1 on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageAssessments/+/890440 ? [14:44:19] * Lucas_WMDE is now imagining pippin, a page of denethor, assessing ents like treebeard and quickbeam [14:44:38] I made it a +2 since that’s the master change and not the backport ^^ [14:44:54] o7 [14:45:58] I can't believe I didn't see that :) [14:46:21] 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netbox, 10Patch-For-Review: Issues converting services from active/passive to active/active - https://phabricator.wikimedia.org/T330084 (10jbond) p:05High→03Medium lowering priority @Vgutierrez confirmed there are no immediate issues with dns. They a... [14:46:56] (03PS1) 10Samtar: PageAssessments.i18n.alias.php: Fix spelling mistake [extensions/PageAssessments] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890447 (https://phabricator.wikimedia.org/T328224) [14:48:03] Lucas_WMDE: +1 on https://gerrit.wikimedia.org/r/c/mediawiki/extensions/PageAssessments/+/890447 just for safety and I'll self-serve the backport :) [14:48:17] 10SRE, 10Observability-Logging, 10serviceops, 10Datacenter-Switchover: Figure out switchover steps for mwlog hosts - https://phabricator.wikimedia.org/T261274 (10Clement_Goubert) >>! In T261274#8629767, @fgiunchedi wrote: >>>! In T261274#8629518, @Clement_Goubert wrote: >> @fgiunchedi Is this still relevan... [14:48:20] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] PageAssessments.i18n.alias.php: Fix spelling mistake [extensions/PageAssessments] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890447 (https://phabricator.wikimedia.org/T328224) (owner: 10Samtar) [14:48:29] done [14:48:48] ta [14:48:57] oh nice, gate-and-submit doesn’t take ages in this extension [14:48:58] (KubernetesAPILatency) resolved: (12) High Kubernetes API latency (GET configmaps) on k8s-mlserve@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:49:03] the main change already merged \o/ [14:49:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy1002 using scap backport" [extensions/PageAssessments] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890447 (https://phabricator.wikimedia.org/T328224) (owner: 10Samtar) [14:49:21] (03PS4) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [14:49:42] (03CR) 10Jbond: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) (owner: 10Volans) [14:51:03] (03Merged) 10jenkins-bot: PageAssessments.i18n.alias.php: Fix spelling mistake [extensions/PageAssessments] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890447 (https://phabricator.wikimedia.org/T328224) (owner: 10Samtar) [14:51:21] !log samtar@deploy1002 Started scap: Backport for [[gerrit:890447|PageAssessments.i18n.alias.php: Fix spelling mistake (T328224)]] [14:51:26] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [14:51:54] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1001.eqiad.wmnet with OS bullseye [14:52:10] (03CR) 10Ayounsi: [C: 03+1] setup.py: remove support for Python 3.7 and 3.8 [software/homer] - 10https://gerrit.wikimedia.org/r/890435 (owner: 10Volans) [14:52:44] (03CR) 10Volans: [C: 03+2] setup.py: remove support for Python 3.7 and 3.8 [software/homer] - 10https://gerrit.wikimedia.org/r/890435 (owner: 10Volans) [14:54:35] (03Merged) 10jenkins-bot: setup.py: remove support for Python 3.7 and 3.8 [software/homer] - 10https://gerrit.wikimedia.org/r/890435 (owner: 10Volans) [14:55:15] oof `build-and-push-container-images` is taking a mo.. [14:57:24] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/890441 [14:58:34] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1001.eqiad.wmnet with OS bullseye [15:03:22] !log UTC afternoon backport window overrunning [15:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:32] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/890441 (owner: 10Volans) [15:03:37] oof [15:04:24] !log samtar@deploy1002 samtar: Backport for [[gerrit:890447|PageAssessments.i18n.alias.php: Fix spelling mistake (T328224)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [15:04:28] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [15:04:28] testing [15:04:51] works, syncing [15:05:17] `Finished sync-testservers (duration: 02m 26s)` is a lil' slow o.O [15:05:40] :| [15:05:53] it was just 00m 006s for me earlier [15:06:40] yeah same for me earlier (: [15:07:02] * TheresNoTime blames Lucas_WMDE [15:07:06] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.2.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/890441 (owner: 10Volans) [15:07:48] :( [15:08:05] * Lucas_WMDE runs `w` [15:08:13] there’s plenty of other people on the server, blame them instead! [15:08:17] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:08:22] >:D [15:08:33] so many tmuxen [15:09:26] haha, the `who` output is funnier because it doesn’t truncate the user name [15:09:41] everyone else fits within eight characters, and then there’s this `lucaswerkmeister-wmde` asshole [15:09:59] that's your shell username?? [15:10:03] oooooooof. [15:10:04] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:10:34] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:11:03] (03PS5) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [15:11:17] >:D [15:11:35] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/890409 (owner: 10L10n-bot) [15:11:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.269 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:11:50] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:13:12] Lucas_WMDE: "Please provide your login22" [15:13:25] !log samtar@deploy1002 Finished scap: Backport for [[gerrit:890447|PageAssessments.i18n.alias.php: Fix spelling mistake (T328224)]] (duration: 22m 03s) [15:13:29] T328224: Deploy PageAssessments to Nepali Wikipedia - https://phabricator.wikimedia.org/T328224 [15:13:37] phew, done, checked live [15:13:49] !log closing UTC afternoon backport window [15:13:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:41] (03PS6) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [15:15:05] yay [15:16:16] !log elukey@cumin1001 START - Cookbook sre.hosts.reboot-single for host ml-staging2001.codfw.wmnet [15:16:49] (03PS1) 10Volans: Upstream release v6.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/890444 [15:18:27] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:19:09] (03PS7) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [15:19:38] (03PS1) 10Alexandros Kosiaris: mathoid: Switch to using node instead of nodejs [deployment-charts] - 10https://gerrit.wikimedia.org/r/890466 (https://phabricator.wikimedia.org/T311620) [15:22:01] (03PS8) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [15:22:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:22:45] (JobUnavailable) firing: Reduced availability for job k8s-pods in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:22:54] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:23:30] (03CR) 10Volans: [C: 03+2] Upstream release v6.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/890444 (owner: 10Volans) [15:24:31] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:25:44] (03PS2) 10Clément Goubert: sre.switchdc.mediawiki: Add mw-on-k8s services [cookbooks] - 10https://gerrit.wikimedia.org/r/889801 (https://phabricator.wikimedia.org/T327924) [15:26:05] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ml-staging2001.codfw.wmnet [15:26:07] (03PS14) 10Clément Goubert: sre.switchdc.services: import service exclusions [cookbooks] - 10https://gerrit.wikimedia.org/r/888213 (https://phabricator.wikimedia.org/T329193) [15:26:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] mathoid: Switch to using node instead of nodejs [deployment-charts] - 10https://gerrit.wikimedia.org/r/890466 (https://phabricator.wikimedia.org/T311620) (owner: 10Alexandros Kosiaris) [15:27:41] (03Merged) 10jenkins-bot: Upstream release v6.2.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/890444 (owner: 10Volans) [15:27:45] (JobUnavailable) resolved: Reduced availability for job k8s-pods in k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:27:50] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:28:07] (03PS1) 10Ilias Sarantopoulos: ml-services: outlink model upgrade debian and python [deployment-charts] - 10https://gerrit.wikimedia.org/r/890471 (https://phabricator.wikimedia.org/T328438) [15:28:20] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:29:41] (03PS4) 10Ayounsi: Allow different port than default 22 [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) [15:32:16] (03Merged) 10jenkins-bot: mathoid: Switch to using node instead of nodejs [deployment-charts] - 10https://gerrit.wikimedia.org/r/890466 (https://phabricator.wikimedia.org/T311620) (owner: 10Alexandros Kosiaris) [15:33:43] (03CR) 10Jbond: gitlab: allow rsync between replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [15:34:41] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:34:54] (03CR) 10Ayounsi: Allow different port than default 22 (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/890394 (https://phabricator.wikimedia.org/T277438) (owner: 10Ayounsi) [15:36:21] (03PS9) 10Jelto: gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) [15:38:13] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39743/console" [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [15:38:40] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:38:46] (03CR) 10Volans: "reply inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) (owner: 10Volans) [15:39:12] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:39:54] (03PS1) 10Elukey: sre.k8s.upgrade-cluster: fix alertmanager downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/890475 (https://phabricator.wikimedia.org/T327767) [15:41:15] 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Juniper structured logs - https://phabricator.wikimedia.org/T250703 (10ayounsi) 05Open→03Declined The need never was very strong, and that would be a pain to integrate with ECS. > The good news is that at least 1 person in the... [15:41:30] (03PS2) 10Elukey: sre.k8s.{upgrade-cluster,wipe-cluster}: fix alertmanager downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/890475 (https://phabricator.wikimedia.org/T327767) [15:41:33] (03CR) 10Volans: [C: 03+1] "LGTM, I'll let you know once the new spicerack is deployed." [cookbooks] - 10https://gerrit.wikimedia.org/r/890475 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [15:41:36] (03CR) 10Atieno: [V: 03+2 C: 03+2] Update README file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/885997 (owner: 10Hokwelum) [15:43:22] (03CR) 10Jelto: [V: 03+1] gitlab: allow rsync between replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [15:43:29] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [15:44:10] (03PS4) 10Hashar: contint: regroup common firewalling rules [puppet] - 10https://gerrit.wikimedia.org/r/887738 (https://phabricator.wikimedia.org/T329056) [15:45:21] (03CR) 10Jelto: "what about insetup::serviceopscollab?" [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [15:45:30] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10WMDE_Norman) [15:45:52] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:46:24] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [15:50:03] (03CR) 10Clément Goubert: site: differentiate between both serviceops teams for insetup roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890014 (owner: 10Dzahn) [15:53:38] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [15:53:39] (03PS1) 10Jbond: dnsquery: Add dnsquery module to vendor_modules [puppet] - 10https://gerrit.wikimedia.org/r/890476 [15:53:41] (03PS1) 10Jbond: wmflib::dns_lookup: switch to dnsquery::lookup [puppet] - 10https://gerrit.wikimedia.org/r/890477 [15:53:43] (03PS1) 10Jbond: pre-commit: update hook [puppet] - 10https://gerrit.wikimedia.org/r/890478 [15:53:57] (03PS2) 10Jbond: pre-commit: update hook [puppet] - 10https://gerrit.wikimedia.org/r/890478 [15:54:04] (03PS1) 10Volans: tests: revert removal of mocked DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/890479 [15:54:48] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1001.eqiad.wmnet with OS bullseye [15:55:27] (03CR) 10Jbond: [C: 03+1] tests: revert removal of mocked DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/890479 (owner: 10Volans) [15:56:14] (03CR) 10EoghanGaffney: [C: 03+2] gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [15:57:12] !log akosiaris@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: sync [15:57:37] !log akosiaris@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: sync [15:58:30] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:58:33] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: allow rsync between replicas [puppet] - 10https://gerrit.wikimedia.org/r/890434 (https://phabricator.wikimedia.org/T329930) (owner: 10Jelto) [15:58:58] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [15:59:31] (03CR) 10Volans: [C: 03+2] tests: revert removal of mocked DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/890479 (owner: 10Volans) [16:01:43] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde and ldap/nda for Schwirz - https://phabricator.wikimedia.org/T330095 (10WMDE_Norman) [16:03:11] (03Merged) 10jenkins-bot: tests: revert removal of mocked DNS resolver [software/spicerack] - 10https://gerrit.wikimedia.org/r/890479 (owner: 10Volans) [16:05:42] (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/890480 [16:06:03] (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/890480 (owner: 10Volans) [16:08:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:09:24] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v6.2.1 [software/spicerack] - 10https://gerrit.wikimedia.org/r/890480 (owner: 10Volans) [16:09:41] !log nfraison@cumin1001 START - Cookbook sre.hosts.reimage for host an-presto1001.eqiad.wmnet with OS bullseye [16:11:19] (03PS1) 10Volans: Upstream release v6.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/890481 [16:11:27] (03CR) 10Volans: [C: 03+2] Upstream release v6.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/890481 (owner: 10Volans) [16:11:38] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:12:25] (03PS1) 10KartikMistry: Section Translation: Fix language code for Cantonese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890482 (https://phabricator.wikimedia.org/T304865) [16:13:00] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:14:45] (03CR) 10Sushrith Bogi: Reduce height of the article toolbar (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890140 (https://phabricator.wikimedia.org/T316950) (owner: 10Sushrith Bogi) [16:14:54] (03Merged) 10jenkins-bot: Upstream release v6.2.1 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/890481 (owner: 10Volans) [16:16:06] (03PS1) 10Jbond: apereo_cas: Add missing docs and fix lint issues [puppet] - 10https://gerrit.wikimedia.org/r/890483 [16:16:08] (03PS1) 10Jbond: apereo_cas: update to use dnsquery functions for lookups [puppet] - 10https://gerrit.wikimedia.org/r/890484 [16:16:29] (03CR) 10Jbond: [C: 03+2] pre-commit: update hook [puppet] - 10https://gerrit.wikimedia.org/r/890478 (owner: 10Jbond) [16:17:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39744/console" [puppet] - 10https://gerrit.wikimedia.org/r/890484 (owner: 10Jbond) [16:18:06] (03CR) 10Volans: [C: 03+2] sre.hosts.provision: add PowerEdge R650xs [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) (owner: 10Volans) [16:18:10] (03PS2) 10Volans: sre.hosts.provision: add PowerEdge R650xs [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) [16:18:53] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, and 2 others: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Volans) I've merged the above patch. @Papaul could you please re-run the provision on those hosts and see if that works and for the new hosts if that fixes the issue? Thanks. [16:18:57] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:20:42] !log uploaded spicerack_6.2.1 to apt.wikimedia.org bullseye-wikimedia [16:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:41] (03PS2) 10Jbond: apereo_cas: update to use dnsquery functions for lookups [puppet] - 10https://gerrit.wikimedia.org/r/890484 [16:22:42] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39745/console" [puppet] - 10https://gerrit.wikimedia.org/r/890484 (owner: 10Jbond) [16:23:18] (03PS1) 10Elukey: knative-serving: add missing labels to the autoscaler's deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/890485 (https://phabricator.wikimedia.org/T327767) [16:23:52] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:24:20] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:25:00] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v6.2.1 [16:25:15] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on cumin2002.codfw.wmnet with reason: test spicerack v6.2.1 [16:28:45] !log nfraison@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-presto1001.eqiad.wmnet with OS bullseye [16:28:56] (03CR) 10Elukey: "Ip ranges look good, I left a note about PKI and etcd." [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [16:29:30] (03CR) 10Elukey: [C: 03+2] knative-serving: add missing labels to the autoscaler's deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/890485 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:30:05] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1630). [16:31:35] (03PS1) 10Muehlenhoff: Remove access for cmacholan [puppet] - 10https://gerrit.wikimedia.org/r/890487 [16:31:39] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:33:22] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:34:40] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [16:35:13] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for cmacholan [puppet] - 10https://gerrit.wikimedia.org/r/890487 (owner: 10Muehlenhoff) [16:36:18] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Christina Macholan out of all services on: 1069 hosts [16:36:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Christina Macholan out of all services on: 1069 hosts [16:38:04] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Christina Macholan out of all services on: 943 hosts [16:38:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Christina Macholan out of all services on: 943 hosts [16:39:15] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: wait_for_optimal() should ignore acked alerts - https://phabricator.wikimedia.org/T319277 (10Volans) Spicerack v6.2.1 ships with this new feature, and it works as expected. One small improvements that we should add below. I think we should add support f... [16:40:49] !log upgraded spicerack to v6.2.1 to the cumin hosts [16:40:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:48] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:42:58] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: spicerack dnsdisc.Discovery attempts to query depooled/disabled dns auth servers - https://phabricator.wikimedia.org/T329773 (10Volans) 05Open→03Resolved p:05Triage→03Medium a:03Volans Spicerack v6.2.1 was deployed with the above fix (see [[ h... [16:43:45] (03CR) 10Volans: [C: 03+1] "LGTM, spicearck v6.2.1 has been deployed to the cumin hosts. LMK if all works as expected." [cookbooks] - 10https://gerrit.wikimedia.org/r/890475 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [16:45:36] ^ nitpick [16:46:04] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:46:38] question_mark: lol [16:47:24] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:48:55] !log akosiaris@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: sync [16:49:22] !log akosiaris@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: sync [16:49:41] !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: sync [16:50:13] !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: sync [16:51:58] (03PS1) 10Nicolas Fraison: kafka-jumbo: reduce min size of root partition [puppet] - 10https://gerrit.wikimedia.org/r/890488 [16:52:15] (03PS1) 10Majavah: alerts: Allow customizing the git repository info [puppet] - 10https://gerrit.wikimedia.org/r/890489 (https://phabricator.wikimedia.org/T304716) [16:52:18] (03PS1) 10Majavah: P:toolforge::prometheus: deploy alert rules from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/890490 (https://phabricator.wikimedia.org/T304716) [16:54:51] (03PS1) 10PipelineBot: mathoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/890350 [16:57:09] (03PS2) 10Majavah: P:toolforge::prometheus: deploy alert rules from GitLab [puppet] - 10https://gerrit.wikimedia.org/r/890490 (https://phabricator.wikimedia.org/T284860) [16:57:35] (03PS2) 10Nicolas Fraison: kafka-jumbo: reduce min size of root partition [puppet] - 10https://gerrit.wikimedia.org/r/890488 [17:00:08] (03PS3) 10Nicolas Fraison: kafka-jumbo: reduce min size of root partition [puppet] - 10https://gerrit.wikimedia.org/r/890488 (https://phabricator.wikimedia.org/T329361) [17:16:12] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [17:18:14] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:18:20] (03CR) 10Elukey: [C: 03+2] sre.k8s.{upgrade-cluster,wipe-cluster}: fix alertmanager downtimes [cookbooks] - 10https://gerrit.wikimedia.org/r/890475 (https://phabricator.wikimedia.org/T327767) (owner: 10Elukey) [17:18:42] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:23:53] (03PS3) 10Volans: sre.hosts.provision: add PowerEdge R650xs [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) [17:25:31] (03CR) 10CI reject: [V: 04-1] sre.hosts.provision: add PowerEdge R650xs [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) (owner: 10Volans) [17:26:22] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [17:27:56] (03PS4) 10JMeybohm: wikikube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) [17:28:46] (03CR) 10JMeybohm: wikikube: Update cluster settings for k8s 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [17:29:03] (03CR) 10Elukey: wikikube: Update cluster settings for k8s 1.23 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [17:29:06] (03PS3) 10Volans: sre.switchdc.mediawiki: Remove ACTIVE_ACTIVE_SECTIONS [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [17:29:22] (03CR) 10Elukey: [C: 03+1] wikikube: Update cluster settings for k8s 1.23 [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [17:30:33] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39746/console" [puppet] - 10https://gerrit.wikimedia.org/r/890390 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [17:30:52] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:31:22] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference-staging_30443: Servers ml-staging2002.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [17:31:23] (03CR) 10Elukey: [C: 03+1] admin_ng: Update wikikube-codfw settings to k8s 1.23 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/890392 (https://phabricator.wikimedia.org/T326617) (owner: 10JMeybohm) [17:31:32] (03CR) 10Volans: [C: 03+2] "Merging as with the newest spicerack this fails CI for other cookbooks CRs." [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [17:33:20] (03Merged) 10jenkins-bot: sre.switchdc.mediawiki: Remove ACTIVE_ACTIVE_SECTIONS [cookbooks] - 10https://gerrit.wikimedia.org/r/889777 (https://phabricator.wikimedia.org/T329533) (owner: 10Clément Goubert) [17:33:36] (03PS4) 10Volans: sre.hosts.provision: add PowerEdge R650xs [cookbooks] - 10https://gerrit.wikimedia.org/r/890438 (https://phabricator.wikimedia.org/T326342) [17:44:02] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [17:46:28] (03PS3) 10Alexandros Kosiaris: KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 [17:47:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 (owner: 10Alexandros Kosiaris) [17:48:03] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [17:48:52] (03Merged) 10jenkins-bot: KubeletOperationalLatency: Bump to 1s [alerts] - 10https://gerrit.wikimedia.org/r/890410 (owner: 10Alexandros Kosiaris) [17:55:29] (03PS2) 10Jbond: admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [17:55:54] (03CR) 10Majavah: admin: add a test to prevent duplicates in users/ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1800) [18:00:05] ryankemper: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T1800). [18:01:12] (03CR) 10Jbond: [C: 03+2] admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:02:10] (03CR) 10Majavah: [C: 04-1] admin: add a test to prevent duplicates in users/ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:09:50] (03PS1) 10Arlolra: Remove wgLinterSubmitterWhitelist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890496 (https://phabricator.wikimedia.org/T329992) [18:11:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P44690 and previous config saved to /var/cache/conftool/dbconfig/20230220-181144-root.json [18:22:36] (03PS3) 10Jbond: admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:22:38] (03PS1) 10Jbond: admin: remove users mmarble/marble [puppet] - 10https://gerrit.wikimedia.org/r/890497 [18:23:21] (03CR) 10CI reject: [V: 04-1] admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P44691 and previous config saved to /var/cache/conftool/dbconfig/20230220-182649-root.json [18:28:25] (03CR) 10Jbond: "thanks see response inline" [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:28:35] (03PS4) 10Jbond: admin: add a test to prevent duplicates in users/ldap_only_users [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:35:19] (03CR) 10Majavah: [C: 03+1] admin: add a test to prevent duplicates in users/ldap_only_users (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:41:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P44692 and previous config saved to /var/cache/conftool/dbconfig/20230220-184154-root.json [18:46:01] (03CR) 10Jbond: [C: 03+2] "great and as always thanks for the contribution :) <3" [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:47:03] anyone around who'd like to start a maintenance script for me? (if not, i'll schedule it properly) [18:47:35] (same as https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230215T1400 but for itwiki) [18:48:34] (03CR) 10Jbond: "on second thoughts will wait until tomorrow before merging this want to ping moritz view on the preceding change" [puppet] - 10https://gerrit.wikimedia.org/r/879421 (owner: 10Majavah) [18:49:01] MatmaRex: hi, sure [18:49:13] yay, thanks [18:50:17] !log taavi@mwmaint1002:~$ mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki itwiki --current --all | tee T315510-itwiki.log # T315510 [18:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:21] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [18:50:26] it's running [18:50:42] thanks taavi [18:56:49] MatmaRex: it logged `Error 1062: Duplicate entry '3929747-119259718' for key 'itr_itemid_id_revision_id'`, but seems to be continuing after that [18:56:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db2165 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P44693 and previous config saved to /var/cache/conftool/dbconfig/20230220-185659-root.json [18:58:01] taavi: interesting, but it shouldn't be a big deal [18:58:47] taavi: the script ran on the wiki previously and was terminated, it should be able to "resume", but maybe there's some edge case with incomplete data [18:59:24] 10SRE, 10serviceops, 10CommRel-Specialists-Support (Jan-Mar-2023), 10Datacenter-Switchover: CommRel support for March 2023 Datacenter Switchover - https://phabricator.wikimedia.org/T328287 (10RZamora-WMF) [19:24:37] (03PS1) 10Majavah: kubeadm: drop wmcs-k8s-secret-for-cert [puppet] - 10https://gerrit.wikimedia.org/r/890501 (https://phabricator.wikimedia.org/T292238) [19:24:39] (03PS1) 10Majavah: kubeadm: update wmcs-k8s-get-cert for certificates/v1 [puppet] - 10https://gerrit.wikimedia.org/r/890502 (https://phabricator.wikimedia.org/T292238) [20:23:00] (03CR) 10Majavah: [C: 04-1] "Holding this until jobs-api has been updated" [puppet] - 10https://gerrit.wikimedia.org/r/890502 (https://phabricator.wikimedia.org/T292238) (owner: 10Majavah) [20:25:33] 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, and 2 others: MediaWiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10Krinkle) [20:26:56] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc-gp1002.eqiad.wmnet with OS bullseye [20:30:00] 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, 10Performance-Team: MediaWiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10Krinkle) 05Open→03Declined I'm triaging this as as task in #Media... [20:38:47] 10SRE-swift-storage, 10Commons, 10MediaWiki-Core-HTTP-Cache, 10MediaWiki-File-management, 10Performance-Team: MediaWiki sometimes displays old image revision despite purge and hard refresh - https://phabricator.wikimedia.org/T317481 (10Novem_Linguae) action=purge didn't work for me at the time. I'll boo... [20:39:20] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc-gp1002.eqiad.wmnet with reason: host reimage [20:39:48] (03PS1) 10Bartosz Dziewoński: Revert "Try to prevent selections inside ref/template nodes on Firefox" [VisualEditor/VisualEditor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890456 (https://phabricator.wikimedia.org/T329983) [20:41:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc-gp1002.eqiad.wmnet with reason: host reimage [20:42:08] (03PS5) 10Superpes15: [tawiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890198 (https://phabricator.wikimedia.org/T329248) [20:42:23] (03PS6) 10Superpes15: [tawiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890198 (https://phabricator.wikimedia.org/T329248) [20:43:04] (03PS1) 10Bartosz Dziewoński: Update VE core submodule to f2528875026a [extensions/VisualEditor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890504 (https://phabricator.wikimedia.org/T329983) [20:58:22] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc-gp1002.eqiad.wmnet with OS bullseye [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T2100). [21:00:05] Superpes and MatmaRex: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] Hi :) [21:00:20] I can deploy [21:00:29] hello [21:00:31] zabe: go for it :) [21:01:06] (03CR) 10Zabe: [C: 03+2] Revert "Try to prevent selections inside ref/template nodes on Firefox" [VisualEditor/VisualEditor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890456 (https://phabricator.wikimedia.org/T329983) (owner: 10Bartosz Dziewoński) [21:03:57] I am not sure if https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/890504/ works this way, don't we need to wait until we have a commit id? [21:03:59] (03Merged) 10jenkins-bot: Revert "Try to prevent selections inside ref/template nodes on Firefox" [VisualEditor/VisualEditor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890456 (https://phabricator.wikimedia.org/T329983) (owner: 10Bartosz Dziewoński) [21:05:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890198 (https://phabricator.wikimedia.org/T329248) (owner: 10Superpes15) [21:05:43] (03CR) 10Zabe: [C: 03+2] Update VE core submodule to f2528875026a [extensions/VisualEditor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890504 (https://phabricator.wikimedia.org/T329983) (owner: 10Bartosz Dziewoński) [21:05:45] (03Merged) 10jenkins-bot: [tawiki] Add Draft and Draft_talk namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890198 (https://phabricator.wikimedia.org/T329248) (owner: 10Superpes15) [21:05:58] !log zabe@deploy1002 Started scap: Backport for [[gerrit:890198|[tawiki] Add Draft and Draft_talk namespaces (T329248)]] [21:06:02] T329248: Enable Draft namespace on Tamil Wikipedia - https://phabricator.wikimedia.org/T329248 [21:06:09] (I just saw that it actually uses the correct commit id) [21:07:43] !log zabe@deploy1002 superpes and zabe: Backport for [[gerrit:890198|[tawiki] Add Draft and Draft_talk namespaces (T329248)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [21:08:12] Superpes: please test :) [21:08:28] Yep checked everything is fine :D Thanks! zabe [21:08:41] Works as expected :P [21:08:54] cool syncing [21:08:57] zabe: i made sure that the commit in the submodule repo is rebased on top of the master branch, so that when it's merged, it's a fast-forward merge and no merge commit is created. because of that i could create the commit in the parent repo pointing to what the top of the master branch will be. otherwise you're right, we'd have to wait. [21:09:46] (well, we wouldn't *have* to, technically it can point to any commit sha1 that exists in the other repo. but we try to always update to the top of the master branch, so that we don't lose our minds trying to keep track of dependencies.) [21:10:08] or… top of the whatever branch. "wmf/1.40.0-wmf.23" here, not "master" [21:11:00] submodules are the worst :) [21:13:16] :) [21:14:51] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:890198|[tawiki] Add Draft and Draft_talk namespaces (T329248)]] (duration: 08m 52s) [21:14:56] T329248: Enable Draft namespace on Tamil Wikipedia - https://phabricator.wikimedia.org/T329248 [21:15:50] Many thanks for you time zabe :) [21:16:20] !log zabe@mwmaint1002:~$ mwscript namespaceDupes.php tawiki --fix # T329248 [21:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:16:24] yw [21:19:27] (03Merged) 10jenkins-bot: Update VE core submodule to f2528875026a [extensions/VisualEditor] (wmf/1.40.0-wmf.23) - 10https://gerrit.wikimedia.org/r/890504 (https://phabricator.wikimedia.org/T329983) (owner: 10Bartosz Dziewoński) [21:21:49] !log zabe@deploy1002 Started scap: T329983 T330104 [21:21:54] T329983: VisualEditor "double click to edit cell" stopped working (in Firefox) - https://phabricator.wikimedia.org/T329983 [21:21:55] T330104: Text selection doesn't work in Firefox when doing translation - https://phabricator.wikimedia.org/T330104 [21:23:59] !log zabe@deploy1002 zabe: T329983 T330104 synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:24:17] MatmaRex: please test [21:24:27] looking [21:25:01] oh, i don't have the extension on firefox :o [21:25:10] oops. one minute [21:27:22] zabe: looks good [21:27:30] syncing [21:33:40] !log zabe@deploy1002 Finished scap: T329983 T330104 (duration: 11m 51s) [21:33:42] MatmaRex: should be live [21:33:46] T329983: VisualEditor "double click to edit cell" stopped working (in Firefox) - https://phabricator.wikimedia.org/T329983 [21:33:46] T330104: Text selection doesn't work in Firefox when doing translation - https://phabricator.wikimedia.org/T330104 [21:33:50] thanks zabe [21:34:36] ye [21:34:39] s/ye/yw [21:35:36] !log close UTC late backport window [21:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:46] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [21:52:48] 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T328832 (10phaultfinder) [22:00:04] Reedy, sbassett, Maryum, and manfredi: My dear minions, it's time we take the moon! Just kidding. Time for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230220T2200). [22:49:45] (03PS1) 10Nray: Add static "Cleopatra" page to facilitate synthetic testing of 885362 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) [22:56:25] (03PS2) 10Nray: Add static "Cleopatra" page to facilitate synthetic testing of 885362 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890509 (https://phabricator.wikimedia.org/T326147) [23:18:27] (03PS43) 10Stevemunene: Update analytics_test conf compatibility with airflow 2.3.4 connect postgresql [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) [23:22:13] (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39747/console" [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [23:48:22] 10SRE, 10ops-codfw, 10DBA, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install db218[567] - https://phabricator.wikimedia.org/T326342 (10Papaul) @Volans thank you I did already the changes manually on 2 hosts but i will run it on the one that I haven't setup yet and let you know. Also it looks like we h... [23:59:08] (03CR) 10Aklapper: "Please see https://www.mediawiki.org/wiki/How_to_become_a_MediaWiki_hacker and abandon this patch, as it does not change code at the right" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/890140 (https://phabricator.wikimedia.org/T316950) (owner: 10Sushrith Bogi)